0
00:00:06,800 --> 00:00:10,102
For the past five years Spark
has been on an absolute tear

1
00:00:10,102 --> 00:00:13,700
becoming one of the most widely
used Technologies in big data

2
00:00:13,700 --> 00:00:17,226
and AI. Today's cutting-edge
companies like Facebook app

3
00:00:17,226 --> 00:00:18,300
will Netflix Uber

4
00:00:18,300 --> 00:00:19,965
and many more have deployed

5
00:00:19,965 --> 00:00:23,366
spark at massive scale
processing petabytes of data

6
00:00:23,366 --> 00:00:25,192
to deliver Innovations ranging

7
00:00:25,192 --> 00:00:27,212
from detecting
fraudulent Behavior

8
00:00:27,212 --> 00:00:30,103
to delivering personalized
experiences in real.

9
00:00:30,103 --> 00:00:32,741
Lifetime and many such
innovations that are

10
00:00:32,741 --> 00:00:34,500
transforming every industry.

11
00:00:34,800 --> 00:00:37,300
Hi all I welcome you
all to this full court session

12
00:00:37,300 --> 00:00:40,408
on Apache spark a complete
crash course consisting

13
00:00:40,408 --> 00:00:43,200
of everything you need
to know to get started

14
00:00:43,200 --> 00:00:45,500
with Apache Spark from scratch.

15
00:00:45,700 --> 00:00:47,410
But before we get into details,

16
00:00:47,410 --> 00:00:51,000
let's look at our agenda for
today for better understanding

17
00:00:51,000 --> 00:00:52,300
and ease of learning.

18
00:00:52,300 --> 00:00:55,400
The entire crash course
is divided into 12 modules

19
00:00:55,400 --> 00:00:59,200
in the first module introduction
to spark will try to understand

20
00:00:59,200 --> 00:01:03,100
what exactly Is and how it
performs real time processing

21
00:01:03,200 --> 00:01:06,741
in second module will dive deep
into different components

22
00:01:06,741 --> 00:01:10,600
that constitute spark will also
learn about Spark architecture

23
00:01:10,600 --> 00:01:13,800
and its ecosystem next up
in the third module.

24
00:01:13,800 --> 00:01:15,594
We will learn what exactly

25
00:01:15,594 --> 00:01:18,700
relational distributed data
sets are in spark.

26
00:01:19,100 --> 00:01:22,427
Fourth module is all about
data frames in this module.

27
00:01:22,427 --> 00:01:25,000
We will learn what
exactly data frames are

28
00:01:25,000 --> 00:01:28,300
and how to perform different
operations in data frames

29
00:01:28,400 --> 00:01:29,940
moving on in the fifth.

30
00:01:29,940 --> 00:01:32,446
Module we will
discuss different ways

31
00:01:32,446 --> 00:01:35,300
that spark provides
to perform SQL queries

32
00:01:35,300 --> 00:01:39,000
for accessing and processing
data in the six module.

33
00:01:39,000 --> 00:01:39,847
We will learn

34
00:01:39,847 --> 00:01:43,500
how to perform streaming
on live data streams using spark

35
00:01:43,500 --> 00:01:46,029
where and in the seventh
module will discuss

36
00:01:46,029 --> 00:01:49,200
how to execute different machine
learning algorithms using

37
00:01:49,200 --> 00:01:52,469
spark machine learning library
8 module is all

38
00:01:52,469 --> 00:01:54,917
about spark Graphics
in this module.

39
00:01:54,917 --> 00:01:57,800
We are going to learn what
graph processing is and

40
00:01:57,800 --> 00:02:01,700
how to perform graph processing
using Bob Graphics library

41
00:02:01,700 --> 00:02:05,500
in the ninth module will discuss
the key differences between

42
00:02:05,500 --> 00:02:08,800
two popular data processing
Paddock rooms mapreduce

43
00:02:08,800 --> 00:02:12,500
and Spark talking
about 10 module will integrate

44
00:02:12,500 --> 00:02:14,400
to popular James spark

45
00:02:14,400 --> 00:02:19,400
and Kafka. 11th module is
all about pyspark in this module

46
00:02:19,400 --> 00:02:21,000
will try to understand

47
00:02:21,000 --> 00:02:24,281
how by spark exposes
spark programming model

48
00:02:24,281 --> 00:02:26,800
to python lastly
in the 12 module.

49
00:02:26,800 --> 00:02:30,100
We'll take a look at most
frequently Asked interview.

50
00:02:30,100 --> 00:02:31,200
Options on spark

51
00:02:31,200 --> 00:02:33,200
which will help you
Ace your interview

52
00:02:33,200 --> 00:02:34,200
with flying colors.

53
00:02:34,200 --> 00:02:35,900
Thank you guys
while you are at it,

54
00:02:35,900 --> 00:02:37,600
please do not
forget to subscribe

55
00:02:37,600 --> 00:02:39,173
and Edureka YouTube channel

56
00:02:39,173 --> 00:02:42,200
to stay updated with
current training Technologies.

57
00:02:47,200 --> 00:02:48,400
There has been -

58
00:02:48,400 --> 00:02:51,576
underworld that spark is
a future of Big Data platform,

59
00:02:51,576 --> 00:02:53,400
which is hundred times faster

60
00:02:53,400 --> 00:02:57,250
than mapreduce and is also
a go-to tool for all solutions.

61
00:02:57,250 --> 00:03:00,019
But what exactly is
Apache spark and what?

62
00:03:00,019 --> 00:03:01,100
It's so popular.

63
00:03:01,100 --> 00:03:03,700
And in the session I will give
you a complete Insight

64
00:03:03,700 --> 00:03:04,600
of Apache spark

65
00:03:04,600 --> 00:03:07,500
and its fundamentals
without any further due.

66
00:03:07,500 --> 00:03:08,200
Let's quickly.

67
00:03:08,200 --> 00:03:09,898
Look at the topics to be covered

68
00:03:09,898 --> 00:03:12,198
in this session
first and foremost.

69
00:03:12,198 --> 00:03:13,000
I will tell you

70
00:03:13,000 --> 00:03:15,724
what is Apache spark
and its features next.

71
00:03:15,724 --> 00:03:17,773
I will take you
to the components

72
00:03:17,773 --> 00:03:18,948
of spark ecosystem

73
00:03:18,948 --> 00:03:21,932
that makes Park as a future
of Big Data platform.

74
00:03:21,932 --> 00:03:22,600
After that.

75
00:03:22,600 --> 00:03:23,300
I will talk

76
00:03:23,300 --> 00:03:26,100
about the fundamental
data structure of spark

77
00:03:26,100 --> 00:03:28,400
that is rdd I will also tell you

78
00:03:28,400 --> 00:03:32,400
about its features its Asians
the ways to create rdd Etc

79
00:03:32,400 --> 00:03:35,500
and at the last either wrap
up the session by giving

80
00:03:35,500 --> 00:03:37,351
a real-time use case of spark.

81
00:03:37,351 --> 00:03:38,505
So let's get started

82
00:03:38,505 --> 00:03:40,800
with the very first
topic and understand

83
00:03:40,800 --> 00:03:43,400
what is spark spark
is an open-source

84
00:03:43,400 --> 00:03:45,100
killable massively parallel

85
00:03:45,100 --> 00:03:47,700
in memory execution
environment for running

86
00:03:47,700 --> 00:03:49,300
analytics applications.

87
00:03:49,300 --> 00:03:52,085
You can just think
of it as an in-memory layer

88
00:03:52,085 --> 00:03:54,507
that sits about the
multiple data stores

89
00:03:54,507 --> 00:03:56,929
where data can be loaded
into the memory

90
00:03:56,929 --> 00:03:59,600
and analyzed in parallel
across the cluster.

91
00:03:59,800 --> 00:04:03,189
Into big data processing much
like mapreduce Park Works

92
00:04:03,189 --> 00:04:05,700
to distribute the data
across the cluster

93
00:04:05,700 --> 00:04:08,118
and then process
that data in parallel.

94
00:04:08,118 --> 00:04:10,833
The difference here is
that unlike mapreduce

95
00:04:10,833 --> 00:04:14,867
which shuffles the files around
the disc spark Works in memory,

96
00:04:14,867 --> 00:04:17,600
and that makes it much
faster at processing

97
00:04:17,600 --> 00:04:19,300
the data than mapreduce.

98
00:04:19,300 --> 00:04:20,663
It is also said to be

99
00:04:20,663 --> 00:04:24,235
the Lightning Fast unified
analytics engine for big data

100
00:04:24,235 --> 00:04:25,600
and machine learning.

101
00:04:25,600 --> 00:04:28,680
So now let's look
at the interesting features

102
00:04:28,680 --> 00:04:29,800
of Apache Spark.

103
00:04:29,800 --> 00:04:32,181
Coming to speed you
can cause Park as

104
00:04:32,181 --> 00:04:34,100
a swift processing framework.

105
00:04:34,100 --> 00:04:37,500
Why because it is
hundred times faster in memory

106
00:04:37,500 --> 00:04:40,900
and 10 times faster on the disk
on comparing it with her.

107
00:04:40,900 --> 00:04:41,700
Do not only

108
00:04:41,700 --> 00:04:45,100
that it also provides
High data processing speed

109
00:04:45,200 --> 00:04:46,900
next powerful cashing.

110
00:04:46,900 --> 00:04:48,809
It has a simple
programming layer

111
00:04:48,809 --> 00:04:50,600
that provides powerful caching

112
00:04:50,600 --> 00:04:53,341
and disk persistence
capabilities and Spark

113
00:04:53,341 --> 00:04:55,300
can be deployed through mesos.

114
00:04:55,300 --> 00:04:58,600
How do PI on or Sparks
own cluster manager

115
00:04:58,700 --> 00:04:59,700
as you all know?

116
00:04:59,700 --> 00:05:01,370
That's Park itself was designed

117
00:05:01,370 --> 00:05:03,900
and developed for
real-time data processing.

118
00:05:03,900 --> 00:05:05,239
So it's obvious fact

119
00:05:05,239 --> 00:05:07,584
that it offers
real-time competition

120
00:05:07,584 --> 00:05:10,800
and low latency because of
in memory competitions

121
00:05:10,900 --> 00:05:14,700
next polyglot spark
provides high level apis

122
00:05:14,700 --> 00:05:16,700
in Java Scala Python

123
00:05:16,700 --> 00:05:19,536
and our spark code
can be written in any

124
00:05:19,536 --> 00:05:21,281
of these four languages.

125
00:05:21,281 --> 00:05:25,500
Not only that it also provides
a shell in Scala and python.

126
00:05:25,692 --> 00:05:29,000
These are the various
features of spark now,

127
00:05:29,000 --> 00:05:32,700
let's see the The various
components of spark ecosystem.

128
00:05:32,700 --> 00:05:36,100
Let me first tell you
about the spark or component.

129
00:05:36,100 --> 00:05:39,385
It is the most vital component
of Spartacus system,

130
00:05:39,385 --> 00:05:40,700
which is responsible

131
00:05:40,700 --> 00:05:44,400
for basic I/O functions
scheduling monitoring Etc.

132
00:05:44,400 --> 00:05:47,800
The entire Apache spark
ecosystem is built on the top

133
00:05:47,800 --> 00:05:49,670
of this core execution engine

134
00:05:49,670 --> 00:05:52,700
which has extensible apis
in different languages

135
00:05:52,700 --> 00:05:55,100
like Scala python are and Chava

136
00:05:55,100 --> 00:05:57,442
as I have already
mentioned the spark

137
00:05:57,442 --> 00:05:59,200
and the departs from essos.

138
00:05:59,200 --> 00:06:02,800
How do you feel John
or Sparks own cluster manager

139
00:06:02,800 --> 00:06:05,433
the spark ecosystem
library is composed

140
00:06:05,433 --> 00:06:06,888
of various components

141
00:06:06,888 --> 00:06:10,700
like spark SQL spark streaming
machine learning library.

142
00:06:10,700 --> 00:06:13,200
Now, let me explain
you each of them.

143
00:06:13,200 --> 00:06:16,573
The spark SQL component
is used to Leverage The Power

144
00:06:16,573 --> 00:06:18,000
of declarative queries

145
00:06:18,000 --> 00:06:21,034
and optimize storage
by executing SQL queries

146
00:06:21,034 --> 00:06:22,000
on spark data,

147
00:06:22,000 --> 00:06:23,778
which is present in the rdds

148
00:06:23,778 --> 00:06:27,100
and other external sources
next Sparks trimming

149
00:06:27,100 --> 00:06:29,617
component allows developers
to perform batch.

150
00:06:29,617 --> 00:06:31,395
Processing and streaming of data

151
00:06:31,395 --> 00:06:35,042
in the same application and come
into machine learning library.

152
00:06:35,042 --> 00:06:36,313
It eases the deployment

153
00:06:36,313 --> 00:06:39,300
and development of scalable
machine learning pipelines,

154
00:06:39,300 --> 00:06:43,000
like summary statistics
correlations feature extraction

155
00:06:43,000 --> 00:06:46,200
transformation functions
optimization algorithms Etc

156
00:06:46,200 --> 00:06:49,365
and graph x component lets
the data scientist to work

157
00:06:49,365 --> 00:06:52,584
with graph are non rough sources
to achieve flexibility

158
00:06:52,584 --> 00:06:55,820
and resilience and graph
construction and transformation

159
00:06:55,820 --> 00:06:56,784
and now talking

160
00:06:56,784 --> 00:07:00,000
about the programming
languages spark supports car.

161
00:07:00,000 --> 00:07:02,851
I just a functional
programming language in which

162
00:07:02,851 --> 00:07:04,100
the spark is written.

163
00:07:04,100 --> 00:07:08,200
So spark supports Colour as
the interface then spark also

164
00:07:08,200 --> 00:07:10,100
supports python interface.

165
00:07:10,100 --> 00:07:13,066
You can write the program
in Python and execute it

166
00:07:13,066 --> 00:07:14,408
over the spark again.

167
00:07:14,408 --> 00:07:16,899
If you see the code
in Python and Scala,

168
00:07:16,899 --> 00:07:20,858
both are very similar then our
is very famous for data analysis

169
00:07:20,858 --> 00:07:22,200
and machine learning.

170
00:07:22,200 --> 00:07:25,081
So spark has also added
the support for our

171
00:07:25,081 --> 00:07:26,717
and it also supports Java

172
00:07:26,717 --> 00:07:27,961
so you can go ahead

173
00:07:27,961 --> 00:07:31,300
and write the code in Java
and Giggle with this park

174
00:07:31,300 --> 00:07:33,300
next the data can be stored

175
00:07:33,300 --> 00:07:36,400
in hdfs local file
system Amazon S3 cloud

176
00:07:36,700 --> 00:07:39,700
and it also supports SQL
and nosql database as well.

177
00:07:39,700 --> 00:07:43,645
So this is all about the various
components of spark ecosystem.

178
00:07:43,645 --> 00:07:45,300
Now, let's see what's next

179
00:07:45,300 --> 00:07:48,064
when it comes to iterative
distributed computing

180
00:07:48,064 --> 00:07:50,600
that is processing the data
over multiple jobs

181
00:07:50,600 --> 00:07:51,600
and competitions.

182
00:07:51,700 --> 00:07:52,776
We need to reuse

183
00:07:52,776 --> 00:07:55,200
or share the data
among multiple jobs

184
00:07:55,200 --> 00:07:58,258
in earlier Frameworks
like Hadoop there were problems

185
00:07:58,258 --> 00:07:59,950
while dealing with multiple.

186
00:07:59,950 --> 00:08:01,400
Operations or jobs here.

187
00:08:01,400 --> 00:08:02,900
We need to store the data

188
00:08:02,900 --> 00:08:07,053
and some intermediate stable
distributed storage such as hdfs

189
00:08:07,053 --> 00:08:11,003
and multiple I/O operations
makes the overall computations

190
00:08:11,003 --> 00:08:13,976
of jobs much slower
and they were replications

191
00:08:13,976 --> 00:08:15,100
and civilizations

192
00:08:15,100 --> 00:08:17,955
which in turn made
the process even more slower

193
00:08:17,955 --> 00:08:20,500
and our goal here was
to reduce the number

194
00:08:20,500 --> 00:08:22,400
of I/O operations to hdfs

195
00:08:22,400 --> 00:08:26,350
and this can be achieved only
through in-memory data sharing

196
00:08:26,350 --> 00:08:29,900
the in-memory data sharing
the stent 200 times faster.

197
00:08:29,900 --> 00:08:31,966
Of the network and disk sharing

198
00:08:31,966 --> 00:08:35,138
and rdds try to solve all
the problems by enabling

199
00:08:35,138 --> 00:08:38,447
fault-tolerant distributed
in memory competitions.

200
00:08:38,447 --> 00:08:40,000
So now let's understand

201
00:08:40,000 --> 00:08:44,000
what our rdds it stands for
resilient distributed data set.

202
00:08:44,000 --> 00:08:46,509
They are considered to be
the backbone of spark

203
00:08:46,509 --> 00:08:49,419
and is one of the fundamental
data structure of spark.

204
00:08:49,419 --> 00:08:51,782
It is also known as
the schema-less structures

205
00:08:51,782 --> 00:08:54,900
that can handle both structured
and unstructured data.

206
00:08:54,900 --> 00:08:57,900
So in spark anything
you do is around rdd.

207
00:08:57,900 --> 00:08:59,700
You're reading the
data in spark.

208
00:08:59,700 --> 00:09:01,500
When it is read
into our daily again,

209
00:09:01,500 --> 00:09:04,300
when you're transforming
the data, then you're performing

210
00:09:04,300 --> 00:09:07,268
Transformations on old rdd
and creating a new one.

211
00:09:07,268 --> 00:09:10,378
Then at last you will perform
some actions on the rdd

212
00:09:10,378 --> 00:09:12,533
and store that data
present in an rdd

213
00:09:12,533 --> 00:09:15,906
to a persistent storage
resilient distributed data set

214
00:09:15,906 --> 00:09:18,900
has an immutable distributed
collection of objects.

215
00:09:18,900 --> 00:09:20,300
Your objects can be anything

216
00:09:20,300 --> 00:09:23,200
like strings lines
Rose objects collections

217
00:09:23,200 --> 00:09:26,400
Etc rdds can contain
any type of python Java

218
00:09:26,400 --> 00:09:27,533
or Scala objects.

219
00:09:27,533 --> 00:09:30,000
Even including user
defined classes as

220
00:09:30,000 --> 00:09:32,900
And talking about
the distributed environment.

221
00:09:32,900 --> 00:09:35,612
Each data set present
in an rdd is divided

222
00:09:35,612 --> 00:09:37,200
into logical partitions,

223
00:09:37,200 --> 00:09:39,353
which may be computed
on different nodes

224
00:09:39,353 --> 00:09:42,500
of the cluster due to this you
can perform Transformations

225
00:09:42,500 --> 00:09:44,190
or actions on the complete data

226
00:09:44,190 --> 00:09:47,300
parallely and I don't have
to worry about the distribution

227
00:09:47,300 --> 00:09:49,400
because spark takes care of that

228
00:09:49,400 --> 00:09:52,100
are they these are
highly resilient that is

229
00:09:52,100 --> 00:09:55,141
they are able to recover
quickly from any issues

230
00:09:55,141 --> 00:09:56,500
as a same data chunks

231
00:09:56,500 --> 00:09:59,700
are replicated across
multiple executor notes thus

232
00:09:59,700 --> 00:10:02,564
so even if one executor
fails another will still

233
00:10:02,564 --> 00:10:03,600
process the data.

234
00:10:03,600 --> 00:10:06,482
This allows you to perform
functional calculations

235
00:10:06,482 --> 00:10:08,287
against a data set very quickly

236
00:10:08,287 --> 00:10:10,699
by harnessing the power
of multiple nodes.

237
00:10:10,699 --> 00:10:12,472
So this is all about rdd now.

238
00:10:12,472 --> 00:10:14,000
Let's have a look at some

239
00:10:14,000 --> 00:10:17,847
of the important features of
our dbe's rdds have a provision

240
00:10:17,847 --> 00:10:19,327
of in memory competition

241
00:10:19,327 --> 00:10:21,300
and all transformations
are lazy.

242
00:10:21,300 --> 00:10:24,044
That is it does not compute
the results right away

243
00:10:24,044 --> 00:10:25,679
until an action is applied.

244
00:10:25,679 --> 00:10:27,800
So it supports
in memory competition

245
00:10:27,800 --> 00:10:30,034
and lazy evaluation
as well next.

246
00:10:30,034 --> 00:10:32,200
Fault tolerant in case of rdds.

247
00:10:32,200 --> 00:10:34,454
They track the data
lineage information

248
00:10:34,454 --> 00:10:37,341
to rebuild the last data
automatically and this is

249
00:10:37,341 --> 00:10:40,000
how it provides fault tolerance
to the system.

250
00:10:40,000 --> 00:10:42,600
Next immutability data
can be created

251
00:10:42,600 --> 00:10:43,800
or received any time

252
00:10:43,800 --> 00:10:46,388
and once defined its value
cannot be changed.

253
00:10:46,388 --> 00:10:47,900
And that is the reason why

254
00:10:47,900 --> 00:10:51,235
I said are they these are
immutable next partitioning

255
00:10:51,235 --> 00:10:53,774
at is the fundamental
unit of parallelism

256
00:10:53,774 --> 00:10:54,605
and Spark rdd

257
00:10:54,605 --> 00:10:57,800
and all the data chunks
are divided into partitions

258
00:10:57,800 --> 00:10:59,960
and already next persistence.

259
00:10:59,960 --> 00:11:01,600
So users can reuse rdd

260
00:11:01,600 --> 00:11:05,400
and choose a storage stategy for
them coarse-grained operations

261
00:11:05,400 --> 00:11:08,493
applies to all elements
in datasets through Maps

262
00:11:08,493 --> 00:11:10,600
or filter or
group by operations.

263
00:11:10,700 --> 00:11:13,000
So these are the various
features of our daily.

264
00:11:13,300 --> 00:11:15,800
Now, let's see
the ways to create rdd.

265
00:11:15,800 --> 00:11:19,117
There are three ways to create
rdds one can create rdd

266
00:11:19,117 --> 00:11:22,800
from paralyzed Collections
and one can also create rdd

267
00:11:22,800 --> 00:11:24,367
from the existing card ID

268
00:11:24,367 --> 00:11:27,100
or other are DTS
and it can also be created

269
00:11:27,100 --> 00:11:30,000
from external data sources
as well like hdfs.

270
00:11:30,000 --> 00:11:31,900
Amazon S3 hbase Etc.

271
00:11:32,000 --> 00:11:34,600
Now let me show you
how to create rdds.

272
00:11:34,800 --> 00:11:37,199
I'll open my terminal
and first check

273
00:11:37,199 --> 00:11:39,600
whether my demons
are running or not.

274
00:11:40,500 --> 00:11:41,300
Cool here.

275
00:11:41,300 --> 00:11:42,757
I can see that Hadoop

276
00:11:42,757 --> 00:11:45,041
and Spark demons
both are running.

277
00:11:45,041 --> 00:11:47,186
So now at the first let's start

278
00:11:47,186 --> 00:11:51,200
the spark shell it will take
a bit time to start the shell.

279
00:11:52,500 --> 00:11:52,900
Cool.

280
00:11:52,900 --> 00:11:54,800
Now the spark shall has started

281
00:11:54,800 --> 00:11:58,329
and I can see the version of
spark as two point one point one

282
00:11:58,329 --> 00:12:00,500
and we have a scholar
shell over here.

283
00:12:00,500 --> 00:12:00,759
Now.

284
00:12:00,759 --> 00:12:02,888
I will tell you
how to create rdds

285
00:12:02,888 --> 00:12:06,557
in three different ways using
Scala language at the first.

286
00:12:06,557 --> 00:12:08,450
Let's see how to create an rdd

287
00:12:08,450 --> 00:12:12,178
from paralyzed collections
SC dot paralyzes the method

288
00:12:12,178 --> 00:12:15,600
that I use to create a paralyzed
collection of oddities

289
00:12:15,600 --> 00:12:16,733
and this method is

290
00:12:16,733 --> 00:12:20,700
a spark context paralyzed method
to create a palace collection.

291
00:12:20,700 --> 00:12:22,500
So I will give a seedot bad.

292
00:12:22,500 --> 00:12:26,200
Lice and here I will paralyze
one 200 numbers.

293
00:12:27,300 --> 00:12:31,371
In five different partitions
and I will apply collect

294
00:12:31,371 --> 00:12:33,500
as action to start the process.

295
00:12:34,900 --> 00:12:36,592
So here in the result,

296
00:12:36,592 --> 00:12:39,600
you can see an array
of fun 200 numbers.

297
00:12:39,600 --> 00:12:40,100
Okay.

298
00:12:40,300 --> 00:12:41,635
Now let me show you

299
00:12:41,635 --> 00:12:45,010
how the partitions appear
in the web UI of spark.

300
00:12:45,010 --> 00:12:49,300
So the web UI port for spark is
localhost four zero four zero.

301
00:12:50,700 --> 00:12:53,630
So here you have just
completed one task.

302
00:12:53,630 --> 00:12:55,903
That is St. Dot
paralyzed collect.

303
00:12:55,903 --> 00:12:56,800
Correct here.

304
00:12:56,800 --> 00:13:00,114
You can see all the five stages
that are succeeded

305
00:13:00,114 --> 00:13:03,700
because we have divided the task
into five partitions.

306
00:13:03,700 --> 00:13:06,000
So let Show you the partitions.

307
00:13:06,000 --> 00:13:08,100
So this is a dag
which realization

308
00:13:08,100 --> 00:13:11,558
that is the directed acyclic
graph visualization wherein

309
00:13:11,558 --> 00:13:14,200
you have applied only
paralyzed as a method

310
00:13:14,200 --> 00:13:16,200
so you can see only
one stage here.

311
00:13:16,800 --> 00:13:20,291
So here you can see the rdd
that is been created

312
00:13:20,291 --> 00:13:24,032
and coming to even timeline
you can see the task

313
00:13:24,032 --> 00:13:27,400
that has been executed
in five different stages

314
00:13:27,400 --> 00:13:29,011
and the different colors imply.

315
00:13:29,011 --> 00:13:30,632
The scheduler delayed tasks

316
00:13:30,632 --> 00:13:34,300
these sterilization Time shuffle
rate Time shuffle right time.

317
00:13:34,300 --> 00:13:36,612
I'm execute a Computing
time Etc here.

318
00:13:36,612 --> 00:13:40,227
You can see the summary metrics
for the created rdd here.

319
00:13:40,227 --> 00:13:41,000
You can see

320
00:13:41,000 --> 00:13:44,300
that the maximum time it
took to execute the tasks

321
00:13:44,300 --> 00:13:48,400
in five partitions parallely is
just 45 milliseconds.

322
00:13:49,000 --> 00:13:53,300
You can also see the executor ID
the host ID the status

323
00:13:53,300 --> 00:13:56,800
that is succeeded
duration launch time Etc.

324
00:13:57,000 --> 00:13:59,255
So this is one way
of creating an rdd

325
00:13:59,255 --> 00:14:01,061
from paralyzed collections.

326
00:14:01,061 --> 00:14:02,400
Now, let me show you

327
00:14:02,400 --> 00:14:05,900
how to create an rdd
from the I think our DD okay

328
00:14:06,000 --> 00:14:08,770
here I'll create
an array called Aven

329
00:14:08,770 --> 00:14:11,077
and assign numbers one to ten.

330
00:14:11,800 --> 00:14:14,900
One two, three,
four five six seven.

331
00:14:16,200 --> 00:14:18,900
Okay, so I got the result here.

332
00:14:18,900 --> 00:14:22,300
That is I have created
an integer array of 1 to 10

333
00:14:22,300 --> 00:14:25,200
and now I will paralyze
this a day one.

334
00:14:31,303 --> 00:14:32,996
Sorry, I got an error.

335
00:14:33,300 --> 00:14:37,300
It is a seedot pass
the lies of a one.

336
00:14:38,200 --> 00:14:42,800
Okay, so I created an rdd
called parallel collection cool.

337
00:14:42,800 --> 00:14:46,600
Now I will create a new Oddity
from the existing already.

338
00:14:46,600 --> 00:14:51,000
That is Val new are d d is equal

339
00:14:51,000 --> 00:14:55,900
to a 1 dot map data
present in an rdd.

340
00:14:56,061 --> 00:14:59,138
I will create a new ID
from existing rdd.

341
00:14:59,200 --> 00:15:01,200
So here I will take a one.

342
00:15:01,200 --> 00:15:05,800
As a difference and map
the data and multiply

343
00:15:05,800 --> 00:15:07,300
that data into two.

344
00:15:07,573 --> 00:15:09,726
So what should be our output

345
00:15:10,019 --> 00:15:13,480
if I Mark the data present
in an rdd into two,

346
00:15:13,700 --> 00:15:18,600
so it would be like
2 4 6 8 up to 20, correct?

347
00:15:18,600 --> 00:15:20,400
So, let's see how it works.

348
00:15:20,700 --> 00:15:24,500
Yes, we got the output
that is multiple of 1 to 10.

349
00:15:24,500 --> 00:15:26,691
That is two four
six eight up to 20.

350
00:15:26,691 --> 00:15:28,357
So this is one of the method

351
00:15:28,357 --> 00:15:30,500
of creating a new ID
from an old rdt.

352
00:15:30,500 --> 00:15:34,088
And I have one more method that
is from external file sources.

353
00:15:34,088 --> 00:15:37,500
So what I will do here is I
will give that test is equal

354
00:15:37,500 --> 00:15:39,780
to SC dot txt file here.

355
00:15:40,790 --> 00:15:43,800
I will give the path
to hdfs file location

356
00:15:43,800 --> 00:15:48,900
and Link the path that is hdfs
who localhost 9000 is a path

357
00:15:48,900 --> 00:15:50,800
and I have a folder.

358
00:15:50,800 --> 00:15:54,600
Called example and in that
I have a file called sample.

359
00:15:57,300 --> 00:16:01,500
Cool, so I got one
more already created here.

360
00:16:02,000 --> 00:16:02,281
Now.

361
00:16:02,281 --> 00:16:04,042
Let me show you this file

362
00:16:04,042 --> 00:16:07,000
that I have already kept
in hdfs directory.

363
00:16:08,100 --> 00:16:09,897
I will browse the file system

364
00:16:09,897 --> 00:16:12,500
and I will show you
the / example directory

365
00:16:12,500 --> 00:16:13,800
that I have created.

366
00:16:14,800 --> 00:16:16,867
So here you can see the example

367
00:16:16,867 --> 00:16:19,800
that I have created as
a directory and here I

368
00:16:19,800 --> 00:16:23,000
have sample as input file
that I have been given.

369
00:16:23,000 --> 00:16:25,800
So here you can see
the same path location.

370
00:16:25,800 --> 00:16:26,300
So this is

371
00:16:26,300 --> 00:16:29,633
how I can create an rdd
from external file sources.

372
00:16:29,633 --> 00:16:30,484
In this case.

373
00:16:30,484 --> 00:16:33,300
I have used hdfs as
an external file source.

374
00:16:33,300 --> 00:16:36,757
So this is how we can create
rdds from three different ways

375
00:16:36,757 --> 00:16:39,700
that is paralyzed collections
from external RDS

376
00:16:39,700 --> 00:16:41,600
and from an existing rdds.

377
00:16:41,700 --> 00:16:44,900
So let's move further and see
the various rdd.

378
00:16:44,900 --> 00:16:46,500
It's actually supports

379
00:16:46,500 --> 00:16:50,100
two men operations namely
Transformations and actions

380
00:16:50,100 --> 00:16:51,419
as have already set.

381
00:16:51,419 --> 00:16:53,200
Our treaties are immutable.

382
00:16:53,200 --> 00:16:54,900
So once you create an rdd,

383
00:16:54,900 --> 00:16:57,500
you cannot change
any content in the Hardy,

384
00:16:57,500 --> 00:16:58,913
so you might be wondering

385
00:16:58,913 --> 00:17:01,400
how our did he applies
those Transformations?

386
00:17:01,400 --> 00:17:02,200
Correct?

387
00:17:02,200 --> 00:17:04,299
When you run
any Transformations,

388
00:17:04,299 --> 00:17:07,062
it runs those Transformations
on all our DD

389
00:17:07,062 --> 00:17:08,445
and create a new body.

390
00:17:08,445 --> 00:17:11,400
This is basically done
for optimization reasons.

391
00:17:11,400 --> 00:17:13,446
Transformations are
the operations

392
00:17:13,446 --> 00:17:14,500
which are applied

393
00:17:14,500 --> 00:17:18,815
on a An rdd to create a new rdd
now these Transformations work

394
00:17:18,815 --> 00:17:21,221
on the principle
of lazy evaluations.

395
00:17:21,221 --> 00:17:23,075
So what does it mean it means

396
00:17:23,075 --> 00:17:25,500
that when we call
some operation in rdd

397
00:17:25,500 --> 00:17:28,888
at does not execute immediately
and Spark montañés,

398
00:17:28,888 --> 00:17:31,704
the record of the operation
that is being called

399
00:17:31,704 --> 00:17:34,127
since Transformations
are lazy in nature

400
00:17:34,127 --> 00:17:36,052
so we can execute the operation

401
00:17:36,052 --> 00:17:38,600
any time by calling
an action on the data.

402
00:17:38,800 --> 00:17:42,200
Hence in lazy evaluation
data is not loaded

403
00:17:42,200 --> 00:17:44,525
until it is necessary now these

404
00:17:44,525 --> 00:17:46,100
Since analyze the RTD

405
00:17:46,100 --> 00:17:49,103
and produce result
simple action can be count

406
00:17:49,103 --> 00:17:52,800
which will count the rows and
rdd and then produce a result

407
00:17:52,800 --> 00:17:53,583
so I can say

408
00:17:53,583 --> 00:17:57,700
that transformation produced new
rdd and action produced results

409
00:17:57,700 --> 00:18:00,058
before moving further
with the discussion.

410
00:18:00,058 --> 00:18:03,000
Let me tell you about
the three different workloads

411
00:18:03,000 --> 00:18:06,500
that spark it is they are
batch mode interactive mode

412
00:18:06,500 --> 00:18:09,052
and streaming mode
in case of batch mode.

413
00:18:09,052 --> 00:18:10,839
We run a batch
of you write a job

414
00:18:10,839 --> 00:18:13,427
and then schedule it
it works through a queue

415
00:18:13,427 --> 00:18:14,703
or a batch of separate.

416
00:18:14,703 --> 00:18:17,292
Jobs without manual
intervention then in case

417
00:18:17,292 --> 00:18:18,400
of interactive mode.

418
00:18:18,400 --> 00:18:19,700
It is an interactive shell

419
00:18:19,700 --> 00:18:22,100
where you go and execute
the commands one by one.

420
00:18:22,300 --> 00:18:24,844
So you will execute
one command check the result

421
00:18:24,844 --> 00:18:26,902
and then execute
other command based

422
00:18:26,902 --> 00:18:28,400
on the output result and so

423
00:18:28,400 --> 00:18:30,754
on it works similar
to the SQL shell

424
00:18:30,754 --> 00:18:32,100
so she'll is the one

425
00:18:32,100 --> 00:18:35,221
which executes a driver program
and in the Shell mode.

426
00:18:35,221 --> 00:18:37,096
You can run it
on the cluster mode.

427
00:18:37,096 --> 00:18:39,449
It is generally used
for development work

428
00:18:39,449 --> 00:18:41,159
or it is used
for ad hoc queries,

429
00:18:41,159 --> 00:18:42,708
then comes the streaming mode

430
00:18:42,708 --> 00:18:44,900
where the program
is continuously running.

431
00:18:44,900 --> 00:18:47,300
As invented data
comes it takes a data

432
00:18:47,300 --> 00:18:48,818
and do some Transformations

433
00:18:48,818 --> 00:18:51,300
and actions on the data
and get some results.

434
00:18:51,300 --> 00:18:53,800
So these are the three
different workloads

435
00:18:53,800 --> 00:18:55,600
that spark 8 us now.

436
00:18:55,600 --> 00:18:58,100
Let's see a real-time
use case here.

437
00:18:58,100 --> 00:18:59,600
I'm considering Yahoo!

438
00:18:59,600 --> 00:19:00,600
As an example.

439
00:19:00,600 --> 00:19:02,716
So what are
the problems of Yahoo!

440
00:19:02,716 --> 00:19:03,128
Yahoo!

441
00:19:03,128 --> 00:19:04,062
Properties are

442
00:19:04,062 --> 00:19:06,800
highly personalized
to maximize relevance.

443
00:19:06,800 --> 00:19:09,600
The algorithms used
to provide personalization.

444
00:19:09,600 --> 00:19:11,692
That is the
targeted advertisement

445
00:19:11,692 --> 00:19:14,800
and personalized content
are highly sophisticated.

446
00:19:14,800 --> 00:19:18,300
It and the relevance model
must be updated frequently

447
00:19:18,300 --> 00:19:22,745
because stories news feed and
ads change in time and Yahoo,

448
00:19:22,745 --> 00:19:24,967
has over 150 petabytes of data

449
00:19:24,967 --> 00:19:28,300
that the stored
on 35,000 node Hadoop cluster,

450
00:19:28,300 --> 00:19:31,391
which should be access
efficiently to avoid latency

451
00:19:31,391 --> 00:19:33,150
caused by the data movement

452
00:19:33,150 --> 00:19:35,300
and to gain insights
from the data

453
00:19:35,300 --> 00:19:37,000
and cost-effective manner.

454
00:19:37,000 --> 00:19:39,600
So to overcome
these problems Yahoo!

455
00:19:39,600 --> 00:19:42,171
Look to spark to
improve the performance

456
00:19:42,171 --> 00:19:44,687
of this iterative
model training here.

457
00:19:44,687 --> 00:19:48,700
Machine learning algorithm for
news personalization required

458
00:19:48,700 --> 00:19:51,200
15,000 lines of C   code

459
00:19:51,300 --> 00:19:55,000
on the other hand the machine
learning algorithm has just

460
00:19:55,000 --> 00:19:57,076
won 20 lines of Scala code.

461
00:19:57,100 --> 00:19:59,600
So that is
the advantage of spark

462
00:19:59,800 --> 00:20:02,600
and this algorithm was ready
for production use

463
00:20:02,600 --> 00:20:06,700
in just 30 minutes of training
on a hundred million datasets

464
00:20:06,700 --> 00:20:08,900
and Sparks Rich API is available

465
00:20:08,900 --> 00:20:12,201
in several programming
languages and has resilient

466
00:20:12,201 --> 00:20:14,588
in memory storage
options and a scum.

467
00:20:14,588 --> 00:20:18,567
Potable with Hadoop through yarn
and the spark yarn project.

468
00:20:18,567 --> 00:20:21,400
It uses Apache spark
for personalizing It's

469
00:20:21,400 --> 00:20:24,490
News web pages and for
targeted advertising.

470
00:20:24,490 --> 00:20:28,300
Not only that it also uses
machine learning algorithms

471
00:20:28,300 --> 00:20:31,375
that run an Apache spark
to find out what kind

472
00:20:31,375 --> 00:20:33,700
of news user are
interested to read

473
00:20:33,700 --> 00:20:36,714
and also for categorizing
the new stories to find

474
00:20:36,714 --> 00:20:39,290
out what kind of users
would be interested

475
00:20:39,290 --> 00:20:41,300
in Reading each category of news

476
00:20:41,524 --> 00:20:44,524
and Spark runs over Hadoop Ian
to use existing data.

477
00:20:44,600 --> 00:20:47,800
And clusters and
the extensive API of spark

478
00:20:47,800 --> 00:20:50,605
and machine learning library
is the development

479
00:20:50,605 --> 00:20:54,276
of machine learning algorithms
and Spar produces the latency

480
00:20:54,276 --> 00:20:55,400
of model training.

481
00:20:55,400 --> 00:20:56,800
We are in memory rdd.

482
00:20:56,800 --> 00:21:00,855
So this is how spark has helped
Yahoo to improve the performance

483
00:21:00,855 --> 00:21:02,431
and achieve the targets.

484
00:21:02,431 --> 00:21:05,320
So I hope you understood
the concept of spark

485
00:21:05,320 --> 00:21:06,700
and its fundamentals.

486
00:21:11,500 --> 00:21:14,000
Now, let me just give
you an overview

487
00:21:14,000 --> 00:21:17,600
of the Spark architecture
Apache spark has a well-defined

488
00:21:17,600 --> 00:21:18,711
layered architecture

489
00:21:18,711 --> 00:21:22,017
where all the components
and layers are Loosely coupled

490
00:21:22,017 --> 00:21:25,200
and integrated with various
extensions and libraries.

491
00:21:25,200 --> 00:21:28,600
This architecture is based
on two main abstractions.

492
00:21:28,600 --> 00:21:31,500
First one resilient
distributed data sets

493
00:21:31,500 --> 00:21:32,419
that is rdd

494
00:21:32,419 --> 00:21:36,108
and the next one directed
acyclic graph called DAC

495
00:21:36,108 --> 00:21:40,100
or th e in order to understand
this park architecture.

496
00:21:40,100 --> 00:21:43,400
You need to first know
the components of the spark

497
00:21:43,400 --> 00:21:44,500
that the spark.

498
00:21:44,500 --> 00:21:47,700
System and its fundamental
data structure rdd.

499
00:21:47,700 --> 00:21:51,100
So let's start by understanding
the spark ecosystem

500
00:21:51,100 --> 00:21:53,080
as you can see from the diagram.

501
00:21:53,080 --> 00:21:56,300
The spark ecosystem is composed
of various components

502
00:21:56,300 --> 00:21:57,812
like spark SQL spark

503
00:21:57,812 --> 00:22:01,400
screaming machine learning
library Graphics spark

504
00:22:01,400 --> 00:22:05,600
our and the code a pi component
talking about spark SQL.

505
00:22:05,600 --> 00:22:08,700
It is used to Leverage The Power
of declarative queries

506
00:22:08,700 --> 00:22:11,827
and optimize storage
by executing SQL queries

507
00:22:11,827 --> 00:22:12,817
on spark data,

508
00:22:12,817 --> 00:22:14,520
which is present in rdds.

509
00:22:14,520 --> 00:22:18,600
And other external sources
next Sparks remain component

510
00:22:18,600 --> 00:22:21,400
allows developers
to perform batch processing

511
00:22:21,400 --> 00:22:22,600
and trimming of the data

512
00:22:22,600 --> 00:22:26,300
and the same application coming
to machine learning library.

513
00:22:26,300 --> 00:22:27,745
It eases the development

514
00:22:27,745 --> 00:22:30,862
and deployment of scalable
machine learning pipelines,

515
00:22:30,862 --> 00:22:33,765
like summary statistics
cluster analysis methods

516
00:22:33,765 --> 00:22:36,709
correlations dimensionality
reduction techniques

517
00:22:36,709 --> 00:22:37,900
feature extractions

518
00:22:37,900 --> 00:22:40,500
and many more now
Graphics component.

519
00:22:40,500 --> 00:22:42,100
Let's the data scientist to work

520
00:22:42,100 --> 00:22:44,689
with graph and non graph
sources to achieve.

521
00:22:44,689 --> 00:22:47,400
Security and resilience
and graph construction

522
00:22:47,400 --> 00:22:51,000
and transformation coming
to spark our it is an r package

523
00:22:51,000 --> 00:22:54,818
that provides a light weighted
front end to use Apache spark.

524
00:22:54,818 --> 00:22:58,000
It provides a distributed
data frame implementation

525
00:22:58,000 --> 00:23:01,994
that supports operations like
selection filtering aggregation,

526
00:23:01,994 --> 00:23:03,500
but on large data sets,

527
00:23:03,500 --> 00:23:06,198
it also supports
distributed machine learning

528
00:23:06,198 --> 00:23:08,100
using machine learning library.

529
00:23:08,157 --> 00:23:10,542
Finally the spark or component.

530
00:23:10,600 --> 00:23:13,600
That is the most vital component
of spark ecosystem,

531
00:23:13,600 --> 00:23:14,800
which is responsible.

532
00:23:14,800 --> 00:23:17,621
Possible for basic
I/O functions scheduling

533
00:23:17,621 --> 00:23:21,517
and monitoring the entire spark
ecosystem is built on the top

534
00:23:21,517 --> 00:23:23,456
of this code execution engine

535
00:23:23,456 --> 00:23:26,600
which has extensible apis
in different languages

536
00:23:26,600 --> 00:23:29,400
like Scala python
are and Java now,

537
00:23:29,400 --> 00:23:32,200
let me tell you
about the programming languages

538
00:23:32,200 --> 00:23:33,977
at the first Spark support

539
00:23:33,977 --> 00:23:37,190
Scala Scala is a functional
programming language

540
00:23:37,190 --> 00:23:38,900
in which spark is written

541
00:23:39,092 --> 00:23:42,400
and Spark suppose Carla
as an interface then

542
00:23:42,400 --> 00:23:44,400
spark also supports python.

543
00:23:44,400 --> 00:23:48,012
Face, you can write program
in Python and execute it

544
00:23:48,012 --> 00:23:49,500
over the spark again.

545
00:23:49,500 --> 00:23:52,166
If you see the code
and Scala and python,

546
00:23:52,166 --> 00:23:56,166
both are very similar then
coming to our it is very famous

547
00:23:56,166 --> 00:23:58,700
for data analysis
and machine learning.

548
00:23:58,700 --> 00:24:01,708
So spark has also added
the support for our

549
00:24:01,708 --> 00:24:03,500
and it also supports Java

550
00:24:03,500 --> 00:24:06,280
so you can go ahead
and write the Java code

551
00:24:06,280 --> 00:24:08,200
and execute it over the spark

552
00:24:08,200 --> 00:24:11,100
against Park also provides
you interactive shell

553
00:24:11,100 --> 00:24:14,005
for Scala Python and are
very can go ahead

554
00:24:14,005 --> 00:24:16,230
and Execute the commands
one by one.

555
00:24:16,230 --> 00:24:18,700
So this is all about
the sparkle ecosystem.

556
00:24:18,700 --> 00:24:19,500
Next.

557
00:24:19,500 --> 00:24:22,600
Let's discuss the fundamental
data structure of spark

558
00:24:22,600 --> 00:24:26,400
that is rdd called as
resilient distributed data sets.

559
00:24:26,784 --> 00:24:30,015
So and Spark anything
you do is around rdd,

560
00:24:30,200 --> 00:24:33,200
you're reading the data
and Spark then it is read

561
00:24:33,200 --> 00:24:34,400
into R DT again.

562
00:24:34,400 --> 00:24:37,200
When you're transforming
the data, then you're performing

563
00:24:37,200 --> 00:24:40,509
Transformations on an old rdd
and creating a new one.

564
00:24:40,509 --> 00:24:43,200
Then at the last you
will perform some actions

565
00:24:43,200 --> 00:24:44,643
on the data and store.

566
00:24:44,643 --> 00:24:46,288
Dataset present in an rdd

567
00:24:46,288 --> 00:24:49,764
to a persistent storage
resilient distributed data

568
00:24:49,764 --> 00:24:53,300
set as an immutable distributed
collection of objects.

569
00:24:53,300 --> 00:24:55,200
Your objects can be anything

570
00:24:55,200 --> 00:24:58,910
like string lines
Rose objects collections Etc.

571
00:24:59,600 --> 00:25:02,704
Now talking about
the distributed environment.

572
00:25:02,704 --> 00:25:06,500
Each data set in rdd is divided
into logical partitions,

573
00:25:06,500 --> 00:25:08,709
which may be computed
on different nodes

574
00:25:08,709 --> 00:25:12,062
of the cluster due to this you
can perform Transformations

575
00:25:12,062 --> 00:25:14,416
and actions on the
complete data parallelly.

576
00:25:14,416 --> 00:25:17,100
And you don't have to worry
about the distribution

577
00:25:17,100 --> 00:25:18,700
because part takes care

578
00:25:18,700 --> 00:25:22,200
of that next as I said our
did these are immutable.

579
00:25:22,200 --> 00:25:25,000
So once you create
an rdd you cannot change

580
00:25:25,000 --> 00:25:26,500
any content in the Rd

581
00:25:26,500 --> 00:25:28,102
so you might be wondering

582
00:25:28,102 --> 00:25:31,500
how our did the applies
those Transformations correct?

583
00:25:31,600 --> 00:25:35,845
Then you run any Transformations
at runs those Transformations

584
00:25:35,845 --> 00:25:38,300
on all our DD
and create a new Oddity.

585
00:25:38,300 --> 00:25:41,700
This is basically done
for optimization reasons.

586
00:25:41,700 --> 00:25:44,609
So, let me tell you
one thing here are decals.

587
00:25:44,609 --> 00:25:46,205
The cached and persistent

588
00:25:46,205 --> 00:25:49,270
if you want to save an rdd
for the future work,

589
00:25:49,270 --> 00:25:50,218
you can cash it

590
00:25:50,218 --> 00:25:53,000
and it will improve
the spark performance rdd

591
00:25:53,000 --> 00:25:55,589
is a fault-tolerant
collection of elements

592
00:25:55,589 --> 00:25:57,800
that can be operated
on in parallel.

593
00:25:57,800 --> 00:26:00,400
If our DD is lost
it will automatically

594
00:26:00,400 --> 00:26:03,400
be recomputed by using
the original Transformations.

595
00:26:03,500 --> 00:26:06,500
This is House Park
provides fault tolerance.

596
00:26:06,500 --> 00:26:10,300
There are two ways to create
rdds first one by paralyzing

597
00:26:10,300 --> 00:26:13,100
an existing collection
in your driver program

598
00:26:13,100 --> 00:26:15,809
and the second one
by Referencing a data set

599
00:26:15,809 --> 00:26:17,700
in the external storage system

600
00:26:17,700 --> 00:26:21,200
such as shared file
system hdfs hbase Etc.

601
00:26:21,400 --> 00:26:23,852
Now Transformations
are the operations

602
00:26:23,852 --> 00:26:27,300
that you perform an rdd
which will create a new body.

603
00:26:27,300 --> 00:26:30,346
For example, you
can perform filter on an rdd

604
00:26:30,346 --> 00:26:31,800
and create a new rdd.

605
00:26:31,800 --> 00:26:34,577
Then there are actions
which analyzes the rdd

606
00:26:34,577 --> 00:26:37,717
and produced result
simple action can be count

607
00:26:37,717 --> 00:26:39,900
which will count
the rows in our D

608
00:26:39,900 --> 00:26:42,100
and producer isn't so I can say

609
00:26:42,100 --> 00:26:46,200
that transformation produced
new ID Actions produce results.

610
00:26:46,200 --> 00:26:47,011
So this is all

611
00:26:47,011 --> 00:26:49,600
about the fundamental
data structure of spark

612
00:26:49,600 --> 00:26:51,000
that is already now.

613
00:26:51,000 --> 00:26:54,300
Let's dive into the core topic
of today's discussion

614
00:26:54,300 --> 00:26:56,120
that the Spark architecture.

615
00:26:56,120 --> 00:26:58,100
So this is
the Spark architecture

616
00:26:58,100 --> 00:26:59,300
in your master node.

617
00:26:59,300 --> 00:27:02,681
You have the driver program
which drives your application.

618
00:27:02,681 --> 00:27:06,300
So the code that you're writing
behaves as a driver program or

619
00:27:06,300 --> 00:27:08,752
if you are using
the interactive shell the shell

620
00:27:08,752 --> 00:27:12,017
acts as a driver program
inside the driver program.

621
00:27:12,017 --> 00:27:12,900
The first thing

622
00:27:12,900 --> 00:27:16,134
that you do is you create
a spark context assume

623
00:27:16,134 --> 00:27:19,300
that the spark context
is a gateway to allspark

624
00:27:19,300 --> 00:27:22,800
functionality at a similar
to your database connection.

625
00:27:22,800 --> 00:27:25,800
So any command you execute
in a database goes

626
00:27:25,800 --> 00:27:29,600
through the database connection
similarly anything you do

627
00:27:29,600 --> 00:27:32,600
on spark goes through
the spark context.

628
00:27:32,700 --> 00:27:34,800
Now this park on text works

629
00:27:34,800 --> 00:27:37,652
with the cluster manager
to manage various jobs,

630
00:27:37,652 --> 00:27:38,783
the driver program

631
00:27:38,783 --> 00:27:42,050
and the spark context takes care
of executing the job

632
00:27:42,050 --> 00:27:44,700
across the cluster
a job is splitted the

633
00:27:45,161 --> 00:27:46,700
And then these tasks

634
00:27:46,700 --> 00:27:48,500
are distributed over
the work or not.

635
00:27:48,500 --> 00:27:50,417
So anytime you create the rtt.

636
00:27:50,417 --> 00:27:53,562
In the spark context
that rdd can be distributed

637
00:27:53,562 --> 00:27:54,900
across various notes

638
00:27:54,900 --> 00:27:58,711
and can be cashed their so rdd
set to be taken partitioned

639
00:27:58,711 --> 00:28:02,426
and distributed across various
notes now worker knows are

640
00:28:02,426 --> 00:28:06,268
the slave nodes whose job is
to basically execute the tasks.

641
00:28:06,268 --> 00:28:07,895
The task is then executed

642
00:28:07,895 --> 00:28:10,500
on the partition rdds
in the worker nodes

643
00:28:10,500 --> 00:28:14,327
and then Returns the result back
to the spark context spot.

644
00:28:14,327 --> 00:28:17,892
Our context takes the job breaks
the shop into the task

645
00:28:17,892 --> 00:28:20,400
and distribute them
on the worker nodes

646
00:28:20,400 --> 00:28:23,900
and these tasks works
on partition rdds perform,

647
00:28:23,900 --> 00:28:26,252
whatever operations you
wanted to perform

648
00:28:26,252 --> 00:28:27,800
and then collect the result

649
00:28:27,800 --> 00:28:30,300
and give it back
to the main Spar context.

650
00:28:30,300 --> 00:28:32,690
If your increase
the number of workers,

651
00:28:32,690 --> 00:28:34,199
then you can divide jobs

652
00:28:34,199 --> 00:28:38,100
and more partitions and execute
them para Leo multiple systems.

653
00:28:38,100 --> 00:28:40,600
This will be actually
lot more faster.

654
00:28:40,600 --> 00:28:42,900
Also if you increase
the number of workers,

655
00:28:42,900 --> 00:28:44,700
it will also
increase your memory.

656
00:28:44,900 --> 00:28:46,746
And you can catch the jobs

657
00:28:46,746 --> 00:28:49,800
so that it can be executed
much more faster.

658
00:28:49,800 --> 00:28:52,231
So this is all
about Spark architecture.

659
00:28:52,231 --> 00:28:52,491
Now.

660
00:28:52,491 --> 00:28:54,709
Let me give you
an infographic idea

661
00:28:54,709 --> 00:28:56,600
about the Spark architecture.

662
00:28:56,600 --> 00:28:59,397
It follows master-slave
architecture here.

663
00:28:59,397 --> 00:29:02,400
The client submits
Park user application code

664
00:29:02,400 --> 00:29:05,189
when an application code
is submitted driver

665
00:29:05,189 --> 00:29:07,200
implicitly converts a user code

666
00:29:07,200 --> 00:29:09,000
that contains Transformations

667
00:29:09,000 --> 00:29:12,700
and actions into a logically
directed graph called DHE

668
00:29:12,700 --> 00:29:14,200
at this stage it also

669
00:29:14,200 --> 00:29:18,172
Performs optimizations such as
pipelining Transformations,

670
00:29:18,172 --> 00:29:21,165
then it converts
a logical graph called DHE

671
00:29:21,165 --> 00:29:23,032
into physical execution plan

672
00:29:23,032 --> 00:29:24,100
with many stages

673
00:29:24,100 --> 00:29:26,972
after converting into
physical execution plan.

674
00:29:26,972 --> 00:29:30,100
It creates a physical
execution units called tasks

675
00:29:30,100 --> 00:29:31,100
under each stage.

676
00:29:31,200 --> 00:29:33,300
Then these tasks are bundled

677
00:29:33,300 --> 00:29:36,300
and sent to the cluster
now driver talks

678
00:29:36,300 --> 00:29:39,523
to the cluster manager
and negotiates a resources

679
00:29:39,523 --> 00:29:42,727
and cluster manager launches
the needed executors

680
00:29:42,727 --> 00:29:45,392
at this point driver
be Also send the task

681
00:29:45,392 --> 00:29:47,828
to the executors based
on the placement

682
00:29:47,828 --> 00:29:51,610
when executor start to register
themselves with the drivers,

683
00:29:51,610 --> 00:29:55,147
so that driver will have
a complete view of the executors

684
00:29:55,147 --> 00:29:57,815
and executors now start
executing the tasks

685
00:29:57,815 --> 00:30:00,099
that are assigned by
the driver program

686
00:30:00,099 --> 00:30:01,300
at any point of time

687
00:30:01,300 --> 00:30:04,800
when the application is running
driver program will monitor

688
00:30:04,800 --> 00:30:06,000
the set of executors

689
00:30:06,000 --> 00:30:07,848
that runs and the driver note

690
00:30:07,848 --> 00:30:11,100
also schedules future tasks
Based on data placement.

691
00:30:11,100 --> 00:30:14,600
So this is how the internal
working takes place in space.

692
00:30:14,600 --> 00:30:17,400
Architecture, there are
three different types

693
00:30:17,400 --> 00:30:18,968
of workloads that spark

694
00:30:18,968 --> 00:30:22,282
and cater first batch mode
in case of batch mode.

695
00:30:22,282 --> 00:30:24,800
We run a bad shop here
you write the job

696
00:30:24,800 --> 00:30:26,100
and then schedule it.

697
00:30:26,100 --> 00:30:28,989
It works through a queue
or batch of separate jobs

698
00:30:28,989 --> 00:30:31,804
through manual intervention
next interactive mode.

699
00:30:31,804 --> 00:30:33,460
This is an interactive shell

700
00:30:33,460 --> 00:30:36,300
where you go and execute
the commands one by one.

701
00:30:36,300 --> 00:30:39,100
So you'll execute
one command check the result

702
00:30:39,100 --> 00:30:41,177
and then execute
the other command based

703
00:30:41,177 --> 00:30:42,700
on the output result and so

704
00:30:42,700 --> 00:30:44,600
on it works similar to the SQL.

705
00:30:44,600 --> 00:30:48,200
Action social is the one
which executes a driver program.

706
00:30:48,200 --> 00:30:50,833
So it is generally used
for development work

707
00:30:50,833 --> 00:30:53,100
or it is also used
for ad hoc queries,

708
00:30:53,100 --> 00:30:54,670
then comes the streaming mode

709
00:30:54,670 --> 00:30:57,200
where the program
is continuously running as

710
00:30:57,200 --> 00:30:59,400
and when the data
comes it takes a data

711
00:30:59,500 --> 00:31:02,000
and do some Transformations
and actions on the data

712
00:31:02,300 --> 00:31:04,200
and then produce output results.

713
00:31:04,400 --> 00:31:06,900
So these are the three
different types of workloads

714
00:31:06,900 --> 00:31:09,000
that spark actually caters now,

715
00:31:09,000 --> 00:31:11,866
let's move ahead and see
a simple demo here.

716
00:31:11,866 --> 00:31:14,600
Let's understand how
to create a spark up.

717
00:31:14,600 --> 00:31:17,000
Location in spark
shell using Scala.

718
00:31:17,000 --> 00:31:18,266
So let's understand

719
00:31:18,266 --> 00:31:21,400
how to create a spark
application in spark shell

720
00:31:21,400 --> 00:31:22,700
using Scala assume

721
00:31:22,700 --> 00:31:25,700
that we have a text file
in the hdfs directory

722
00:31:25,700 --> 00:31:28,900
and we are counting the number
of words in that text file.

723
00:31:28,900 --> 00:31:30,421
So, let's see how to do it.

724
00:31:30,421 --> 00:31:32,900
So before I start running,
let me first check

725
00:31:32,900 --> 00:31:34,900
whether all my demons
are running or not.

726
00:31:35,200 --> 00:31:37,100
So I'll type sudo JPS

727
00:31:37,200 --> 00:31:40,600
so all my spark demons
and Hadoop elements are running

728
00:31:40,600 --> 00:31:44,353
that I have master/worker
as Park demon son named notice.

729
00:31:44,353 --> 00:31:47,400
Manager non-manager everything
as Hadoop team it.

730
00:31:47,400 --> 00:31:48,749
So the first thing

731
00:31:48,749 --> 00:31:51,600
that I do here is
I run the spark shell

732
00:31:51,700 --> 00:31:54,700
so it takes bit time
to start in the meanwhile.

733
00:31:54,700 --> 00:31:56,700
Let me tell you the web UI port

734
00:31:56,700 --> 00:31:59,623
for spark shell is
localhost for 0 4 0.

735
00:32:00,300 --> 00:32:02,900
So this is a web
UI first Park like

736
00:32:02,900 --> 00:32:06,400
if you click on jobs right now,
we have not executed anything.

737
00:32:06,400 --> 00:32:08,861
So there is
no details over here.

738
00:32:09,400 --> 00:32:11,900
So there you have job stages.

739
00:32:12,100 --> 00:32:14,200
So once you execute the chops

740
00:32:14,200 --> 00:32:16,300
If you'll be having
the records of the tasks

741
00:32:16,300 --> 00:32:17,700
that you have executed here.

742
00:32:17,700 --> 00:32:20,400
So here you can see
the stages of various jobs

743
00:32:20,400 --> 00:32:21,706
and tasks executed.

744
00:32:21,706 --> 00:32:22,943
So now let's check

745
00:32:22,943 --> 00:32:25,900
whether our spark
shall have started or not.

746
00:32:25,900 --> 00:32:26,500
Yes.

747
00:32:26,500 --> 00:32:30,074
So you have your spark version
as two point one point one

748
00:32:30,074 --> 00:32:32,500
and you have a scholar
shell over here.

749
00:32:32,600 --> 00:32:34,300
So before I start the code,

750
00:32:34,300 --> 00:32:36,300
let's check the content
that is present

751
00:32:36,300 --> 00:32:38,600
in the input text file
by running this command.

752
00:32:38,933 --> 00:32:39,933
So I'll write

753
00:32:39,933 --> 00:32:44,000
where test is equal
to SC dot txt file

754
00:32:44,000 --> 00:32:46,700
because I have saved
a text file over there

755
00:32:46,700 --> 00:32:49,300
and I'll give
the hdfs part location.

756
00:32:50,000 --> 00:32:52,900
I've stored my text file
in this location.

757
00:32:53,300 --> 00:32:55,600
And Sample is the name
of the text file.

758
00:32:55,600 --> 00:32:58,400
So now let me give
test dot collect

759
00:32:58,400 --> 00:32:59,834
so that it collects the data

760
00:32:59,834 --> 00:33:02,600
and displays the data that
is present in the text file.

761
00:33:02,600 --> 00:33:04,500
So in my text file,

762
00:33:04,500 --> 00:33:08,500
I have Hadoop research analysts
data science and science.

763
00:33:08,500 --> 00:33:10,500
So this is my input data.

764
00:33:10,500 --> 00:33:12,200
So now let me map

765
00:33:12,200 --> 00:33:15,600
the functions and apply
the Transformations and actions.

766
00:33:15,600 --> 00:33:20,000
So I'll give our map is equal
to SC dot txt file

767
00:33:20,000 --> 00:33:22,600
and I will specify

768
00:33:22,600 --> 00:33:28,800
my but location So this
is my input part location

769
00:33:29,073 --> 00:33:32,226
and I'll apply
the flat map transformation

770
00:33:32,457 --> 00:33:33,842
to split the data.

771
00:33:36,100 --> 00:33:38,100
There are separated by space

772
00:33:38,900 --> 00:33:44,330
and then map the word count to
be given as word comma one now.

773
00:33:44,330 --> 00:33:46,100
This would be executed.

774
00:33:46,100 --> 00:33:46,600
Yes.

775
00:33:47,100 --> 00:33:49,000
Now, let me apply the action

776
00:33:49,000 --> 00:33:52,000
for this to start
the execution of the task.

777
00:33:52,900 --> 00:33:56,100
So let me tell you one thing
here before applying an action.

778
00:33:56,100 --> 00:33:58,600
This park will not start
the execution process.

779
00:33:58,600 --> 00:34:00,600
So here I have applied
produced by key

780
00:34:00,600 --> 00:34:02,800
as the action to start
counting the number

781
00:34:02,800 --> 00:34:04,100
of words in the text file.

782
00:34:04,500 --> 00:34:07,100
So now we are done
with applying Transformations

783
00:34:07,100 --> 00:34:08,300
and actions as well.

784
00:34:08,300 --> 00:34:09,774
So now the next step is

785
00:34:09,774 --> 00:34:13,300
to specify the output location
to store the output file.

786
00:34:13,300 --> 00:34:16,400
So I will give
as counts dot save as text file

787
00:34:16,400 --> 00:34:19,500
and then specify
the location form output file.

788
00:34:19,500 --> 00:34:21,398
I'll sort it
in the same location

789
00:34:21,398 --> 00:34:23,000
where I have my input file.

790
00:34:23,700 --> 00:34:28,400
Never specify my output
file name as output 9 cool.

791
00:34:29,000 --> 00:34:31,200
I forgot to give
a double quotes.

792
00:34:31,800 --> 00:34:33,200
And I will run this.

793
00:34:36,603 --> 00:34:38,296
So it's completed now.

794
00:34:38,473 --> 00:34:40,626
So now let's see the output.

795
00:34:41,000 --> 00:34:42,900
I will open my Hadoop web UI

796
00:34:42,900 --> 00:34:45,750
by giving local lost Phi
double zero seven zero

797
00:34:45,750 --> 00:34:48,600
and browse the file system
to check the output.

798
00:34:48,900 --> 00:34:50,284
So as I have said,

799
00:34:50,284 --> 00:34:54,000
I have example asthma director
that I have created

800
00:34:54,000 --> 00:34:57,600
and in that I have specified
output 9 as my output.

801
00:34:57,600 --> 00:35:00,300
So I have the two part
files been created.

802
00:35:00,300 --> 00:35:02,600
Let's check each
of them one by one.

803
00:35:04,800 --> 00:35:06,512
So we have the data count

804
00:35:06,512 --> 00:35:09,116
as one analyst count
as one and science

805
00:35:09,116 --> 00:35:12,200
count as two so this is
a first part file now.

806
00:35:12,200 --> 00:35:14,200
Let me open the second
part file for you.

807
00:35:18,500 --> 00:35:20,800
So this is the second
part file there you

808
00:35:20,800 --> 00:35:23,800
have Hadoop count as one
and the research count as one.

809
00:35:24,500 --> 00:35:26,558
So now let me show
you the text file

810
00:35:26,558 --> 00:35:28,600
that we have specified
as the input.

811
00:35:30,200 --> 00:35:31,363
So as I have told

812
00:35:31,363 --> 00:35:34,076
you Hadoop counters
one research count as

813
00:35:34,076 --> 00:35:37,400
one analyst one data one signs
and signs as 1 1 so

814
00:35:37,400 --> 00:35:39,600
in might be thinking
data science is a one word

815
00:35:39,600 --> 00:35:40,969
no in the program code.

816
00:35:40,969 --> 00:35:44,600
We have asked to count the word
that the separated by a space.

817
00:35:44,600 --> 00:35:47,600
So that is why we have
science count as two.

818
00:35:47,600 --> 00:35:51,100
I hope you got an idea
about how word count works.

819
00:35:51,515 --> 00:35:54,900
Similarly, I will now
paralyzed 1/200 numbers

820
00:35:54,900 --> 00:35:56,200
and divide the tasks

821
00:35:56,200 --> 00:36:00,100
into five partitions to show
you what is partitions of tusks.

822
00:36:00,100 --> 00:36:04,400
So I will write a seedot
paralyzed 1/200 numbers

823
00:36:04,403 --> 00:36:07,096
and divide them
into five partitions

824
00:36:07,115 --> 00:36:10,900
and apply collect action
to collect the numbers

825
00:36:10,900 --> 00:36:12,700
and start the execution.

826
00:36:12,784 --> 00:36:16,015
So it displays you
an array of 100 numbers.

827
00:36:16,300 --> 00:36:20,900
Now, let me explain you the job
stages partitions even timeline.

828
00:36:20,900 --> 00:36:23,100
Dag representation
and everything.

829
00:36:23,100 --> 00:36:26,023
So now let me go
to the web UI of spark

830
00:36:26,023 --> 00:36:27,437
and click on jobs.

831
00:36:27,601 --> 00:36:29,294
So these are the tasks

832
00:36:29,294 --> 00:36:33,217
that have submitted so
coming to word count example.

833
00:36:33,700 --> 00:36:36,300
So this is the
dagger usual ization.

834
00:36:36,300 --> 00:36:38,700
I hope you can see
it clearly first

835
00:36:38,700 --> 00:36:40,401
you collected the text file,

836
00:36:40,401 --> 00:36:42,709
then you applied
flatmap transformation

837
00:36:42,709 --> 00:36:45,139
and mapped it to count
the number of words

838
00:36:45,139 --> 00:36:47,333
and then applied
Reduce by key action

839
00:36:47,333 --> 00:36:49,100
and then save the output file

840
00:36:49,100 --> 00:36:50,500
as save as text file.

841
00:36:50,500 --> 00:36:52,900
So this is Entire
tag visualization

842
00:36:52,900 --> 00:36:54,000
of the number of steps

843
00:36:54,000 --> 00:36:56,000
that we have covered
in our program.

844
00:36:56,000 --> 00:36:58,271
So here it shows
the completed stages

845
00:36:58,271 --> 00:37:01,900
that is two stages
and it also shows the duration

846
00:37:01,900 --> 00:37:03,284
that is 2 seconds.

847
00:37:03,400 --> 00:37:05,800
And if you click
on the event timeline,

848
00:37:05,800 --> 00:37:08,482
it just shows the executor
that is added.

849
00:37:08,482 --> 00:37:11,500
And in this case you
cannot see any partitions

850
00:37:11,500 --> 00:37:15,300
because you have not split the
jobs into various partitions.

851
00:37:15,500 --> 00:37:19,200
So this is how you can see
the even timeline and the -

852
00:37:19,200 --> 00:37:21,700
visualization here you
you can also see

853
00:37:21,700 --> 00:37:24,759
the stage ID descriptions
when you have submitted

854
00:37:24,759 --> 00:37:26,800
that I have just
submitted it now

855
00:37:26,800 --> 00:37:29,294
and in this it also
shows the duration

856
00:37:29,294 --> 00:37:32,800
that it took to execute the task
and the output pipes

857
00:37:32,800 --> 00:37:35,500
that it took the shuffle
rate Shuffle right

858
00:37:35,500 --> 00:37:39,100
and many more now to show
you the partitions see

859
00:37:39,100 --> 00:37:42,500
in this you just applied
SC dot paralyzed, right?

860
00:37:42,500 --> 00:37:45,151
So it is just showing
one stage where you

861
00:37:45,151 --> 00:37:48,400
have applied the parallelized
transformation here.

862
00:37:48,400 --> 00:37:51,300
It shows the succeeded
task as Phi by Phi.

863
00:37:51,300 --> 00:37:54,700
That is you have divided
the task into five stages

864
00:37:54,700 --> 00:37:58,762
and all the five stages has been
executed successfully now here

865
00:37:58,762 --> 00:38:02,300
you can see the partitions
of the five different stages

866
00:38:02,300 --> 00:38:04,112
that is executed in parallel.

867
00:38:04,112 --> 00:38:05,800
So depending on the colors,

868
00:38:05,800 --> 00:38:07,500
it shows the scheduler delay

869
00:38:07,500 --> 00:38:10,500
the shuffle rate time
executor Computing time result

870
00:38:10,500 --> 00:38:11,500
civilization time

871
00:38:11,500 --> 00:38:13,921
and getting result time
and many more

872
00:38:13,921 --> 00:38:15,836
so you can see that duration

873
00:38:15,836 --> 00:38:19,252
that it took to execute
the five tasks in parallel

874
00:38:19,252 --> 00:38:21,263
at the same time as maximum.

875
00:38:21,263 --> 00:38:22,700
Um one milliseconds.

876
00:38:22,700 --> 00:38:26,200
So in memory spark as
much faster computation

877
00:38:26,200 --> 00:38:27,810
and you can see the IDS

878
00:38:27,810 --> 00:38:31,100
of all the five different
tasks all our success.

879
00:38:31,100 --> 00:38:33,166
You can see the locality level.

880
00:38:33,166 --> 00:38:37,033
You can see the executor and
the host IP ID the launch time

881
00:38:37,033 --> 00:38:39,100
the duration it take everything

882
00:38:39,200 --> 00:38:40,631
so you can also see

883
00:38:40,631 --> 00:38:44,978
that we have created our DT
and paralyzed it similarly here

884
00:38:44,978 --> 00:38:47,000
also for word count example,

885
00:38:47,000 --> 00:38:48,306
you can see the rdd

886
00:38:48,306 --> 00:38:51,324
that has been created
and also the Actions

887
00:38:51,324 --> 00:38:53,800
that have applied
to execute the task

888
00:38:54,000 --> 00:38:57,401
and you can see the duration
that it took even here also,

889
00:38:57,401 --> 00:38:58,980
it's just one milliseconds

890
00:38:58,980 --> 00:39:02,200
that it took to execute
the entire word count example,

891
00:39:02,200 --> 00:39:05,900
and you can see the ID is
locality level executor ID.

892
00:39:05,900 --> 00:39:06,916
So in this case,

893
00:39:06,916 --> 00:39:09,712
we have just executed
the task in two stages.

894
00:39:09,712 --> 00:39:11,900
So it is just showing
the two stages.

895
00:39:11,900 --> 00:39:13,100
So this is all about

896
00:39:13,100 --> 00:39:16,266
how web UI looks and what are
the features and information

897
00:39:16,266 --> 00:39:18,435
that you can see
in the web UI of spark

898
00:39:18,435 --> 00:39:21,200
after executing the program
and the Scala shell.

899
00:39:21,200 --> 00:39:22,271
So in this program,

900
00:39:22,271 --> 00:39:25,635
you can see that first gave
the part to the input location

901
00:39:25,635 --> 00:39:26,700
and check the data

902
00:39:26,700 --> 00:39:29,063
that is presented
in the input file.

903
00:39:29,063 --> 00:39:31,900
And then we applied
flatmap Transformations

904
00:39:31,900 --> 00:39:33,100
and created rdd

905
00:39:33,100 --> 00:39:36,800
and then applied action to start
the execution of the task

906
00:39:36,800 --> 00:39:39,500
and save the output file
in this location.

907
00:39:39,500 --> 00:39:41,643
So I hope you got a clear idea

908
00:39:41,643 --> 00:39:45,054
of how to execute a word
count example and check

909
00:39:45,054 --> 00:39:46,861
for the various features

910
00:39:46,861 --> 00:39:50,700
and Spark web UI like
partitions that visualisations

911
00:39:50,700 --> 00:39:59,900
and I hope you found the session
interesting Apache spark.

912
00:40:00,000 --> 00:40:03,900
This word can generate a spark
in every Hadoop Engineers mind.

913
00:40:03,900 --> 00:40:06,188
It is a big data
processing framework,

914
00:40:06,188 --> 00:40:08,805
which is lightning fast
and cluster Computing.

915
00:40:08,805 --> 00:40:12,300
And the core reason behind
its outstanding performance is

916
00:40:12,300 --> 00:40:15,500
the resilient distributed
data set or in short.

917
00:40:15,500 --> 00:40:17,779
They are DD and today I'll focus

918
00:40:17,779 --> 00:40:20,200
on the topic called
rdd using spark

919
00:40:20,200 --> 00:40:21,723
before we get Get started.

920
00:40:21,723 --> 00:40:23,900
Let's have a quick look
on the agenda.

921
00:40:23,900 --> 00:40:24,900
For today's session.

922
00:40:25,100 --> 00:40:28,213
We shall start with
understanding the need for rdds

923
00:40:28,213 --> 00:40:29,272
where we'll learn

924
00:40:29,272 --> 00:40:32,200
the reasons behind which
the rdds were required.

925
00:40:32,200 --> 00:40:34,700
Then we shall learn
what our rdds

926
00:40:34,700 --> 00:40:37,871
where will understand
what exactly an rdd is

927
00:40:37,871 --> 00:40:39,800
and how do they work later?

928
00:40:39,800 --> 00:40:42,400
I'll walk you through
the fascinating features

929
00:40:42,400 --> 00:40:46,300
of rdds such as in
memory computation partitioning

930
00:40:46,374 --> 00:40:48,475
persistence fault tolerance

931
00:40:48,475 --> 00:40:49,475
and many more

932
00:40:49,600 --> 00:40:51,200
once I finished a theory

933
00:40:51,300 --> 00:40:53,200
I'll get your hands on rdds

934
00:40:53,200 --> 00:40:55,100
where will practically create

935
00:40:55,100 --> 00:40:58,141
and perform all possible
operations on a disease

936
00:40:58,141 --> 00:40:59,500
and finally I'll wind

937
00:40:59,500 --> 00:41:02,677
up this session with
an interesting Pokémon use case,

938
00:41:02,677 --> 00:41:06,100
which will help you understand
rdds in a much better way.

939
00:41:06,100 --> 00:41:08,100
Let's get started spark is one

940
00:41:08,100 --> 00:41:10,792
of the top mandatory skills
required by each

941
00:41:10,792 --> 00:41:12,518
and every Big Data developer.

942
00:41:12,518 --> 00:41:14,687
It is used
in multiple applications,

943
00:41:14,687 --> 00:41:17,800
which need real-time processing
such as Google's

944
00:41:17,800 --> 00:41:21,066
recommendation engine credit
card fraud detection.

945
00:41:21,066 --> 00:41:23,713
And many more to understand
this in depth.

946
00:41:23,713 --> 00:41:27,200
We shall consider Amazon's
recommendation engine assume

947
00:41:27,200 --> 00:41:29,500
that you are searching
for a mobile phone

948
00:41:29,500 --> 00:41:33,126
and Amazon and you have certain
specifications of your choice.

949
00:41:33,126 --> 00:41:36,742
Then the Amazon search engine
understands your requirements

950
00:41:36,742 --> 00:41:38,450
and provides you the products

951
00:41:38,450 --> 00:41:41,155
which match the specifications
of your choice.

952
00:41:41,155 --> 00:41:43,800
All this is made possible
because of the most

953
00:41:43,800 --> 00:41:46,717
powerful tool existing
in Big Data environment,

954
00:41:46,717 --> 00:41:49,000
which is none other
than Apache spark

955
00:41:49,000 --> 00:41:51,000
and resilient distributed data.

956
00:41:51,000 --> 00:41:53,946
Is considered to be
the heart of Apache spark.

957
00:41:53,946 --> 00:41:56,735
So with this let's begin
our first question.

958
00:41:56,735 --> 00:41:58,300
Why do we need a disease?

959
00:41:58,300 --> 00:42:01,410
Well, the current world
is expanding the technology

960
00:42:01,410 --> 00:42:02,903
and artificial intelligence

961
00:42:02,903 --> 00:42:06,891
is the face for this Evolution
the machine learning algorithms

962
00:42:06,891 --> 00:42:09,300
and the data needed
to train these computers

963
00:42:09,300 --> 00:42:10,453
are huge the logic

964
00:42:10,453 --> 00:42:13,378
behind all these algorithms
are very complicated

965
00:42:13,378 --> 00:42:17,300
and mostly run in a distributed
and iterative computation method

966
00:42:17,300 --> 00:42:19,800
the machine learning
algorithms could not use

967
00:42:19,800 --> 00:42:21,053
the older mapreduce.

968
00:42:21,053 --> 00:42:24,500
Grams, because the traditional
mapreduce programs needed

969
00:42:24,500 --> 00:42:26,733
a stable State hdfs and we know

970
00:42:26,733 --> 00:42:31,200
that hdfs generates redundancy
during intermediate computations

971
00:42:31,200 --> 00:42:34,800
which resulted in a major
latency in data processing

972
00:42:34,800 --> 00:42:36,900
and in hdfs gathering data

973
00:42:36,900 --> 00:42:39,400
for multiple processing units
at a single instance.

974
00:42:39,400 --> 00:42:42,752
First time consuming along
with this the major issue

975
00:42:42,752 --> 00:42:46,600
was the HTF is did not have
random read and write ability.

976
00:42:46,600 --> 00:42:49,000
So using this old
mapreduce programs

977
00:42:49,000 --> 00:42:52,000
for machine learning
problems would be Then

978
00:42:52,000 --> 00:42:53,700
the spark was introduced

979
00:42:53,700 --> 00:42:55,318
compared to mapreduce spark

980
00:42:55,318 --> 00:42:58,435
is an advanced big data
processing framework resilient

981
00:42:58,435 --> 00:42:59,503
distributed data set

982
00:42:59,503 --> 00:43:02,423
which is a fundamental
and most crucial data structure

983
00:43:02,423 --> 00:43:03,600
of spark was the one

984
00:43:03,600 --> 00:43:06,900
which made it all possible rdds
are effortless to create

985
00:43:06,900 --> 00:43:09,205
and the mind-blowing
property with solve.

986
00:43:09,205 --> 00:43:12,500
The problem was it's in memory
data processing capability

987
00:43:12,500 --> 00:43:15,600
Oddity is not a distributed
file system instead.

988
00:43:15,600 --> 00:43:17,894
It is a distributed
collection of memory

989
00:43:17,894 --> 00:43:19,905
where the data needed
is always stored

990
00:43:19,905 --> 00:43:21,057
and kept available.

991
00:43:21,057 --> 00:43:24,269
Lynn RAM and because of
this property the elevation it

992
00:43:24,269 --> 00:43:27,300
gave to the memory
accessing speed was unbelievable

993
00:43:27,300 --> 00:43:29,250
The Oddities our fault tolerant

994
00:43:29,250 --> 00:43:32,900
and this property bought it
a Dignity of a whole new level.

995
00:43:32,900 --> 00:43:35,074
So our next question would be

996
00:43:35,074 --> 00:43:38,522
what are rdds the resilient
distributed data sets

997
00:43:38,522 --> 00:43:39,600
or the rdds are

998
00:43:39,600 --> 00:43:42,600
the primary underlying
data structures of spark.

999
00:43:42,600 --> 00:43:44,311
They are highly fault tolerant

1000
00:43:44,311 --> 00:43:46,900
and the store data
amongst multiple computers

1001
00:43:46,900 --> 00:43:51,000
in a network the data is written
into multiple executable notes.

1002
00:43:51,000 --> 00:43:54,800
So that in case of a Calamity
if any executing node fails,

1003
00:43:54,800 --> 00:43:57,459
then within a fraction
of second it gets back up

1004
00:43:57,459 --> 00:43:59,100
from the next executable node

1005
00:43:59,100 --> 00:44:02,200
with the same processing speeds
of the current node,

1006
00:44:02,300 --> 00:44:04,900
the fault-tolerant property
enables them to roll back

1007
00:44:04,900 --> 00:44:06,876
their data to the original state

1008
00:44:06,876 --> 00:44:09,038
by applying simple
Transformations on

1009
00:44:09,038 --> 00:44:11,225
to the Lost part
in the lineage hard.

1010
00:44:11,225 --> 00:44:13,696
It is do not need
anything called hard disk

1011
00:44:13,696 --> 00:44:15,489
or any other secondary storage

1012
00:44:15,489 --> 00:44:17,700
all that they need
is the main memory,

1013
00:44:17,700 --> 00:44:18,700
which is Ram now

1014
00:44:18,700 --> 00:44:21,100
that we have understood
the need for our dear.

1015
00:44:21,100 --> 00:44:22,482
It is and what exactly

1016
00:44:22,482 --> 00:44:25,204
an RTD is so let us see
the different sources

1017
00:44:25,204 --> 00:44:28,223
from which the data
can be ingested into an rdd.

1018
00:44:28,223 --> 00:44:30,600
The data can be loaded
from any Source

1019
00:44:30,600 --> 00:44:33,700
like hdfs hbase high C ql

1020
00:44:33,700 --> 00:44:34,658
you name it?

1021
00:44:34,658 --> 00:44:35,582
They got it.

1022
00:44:35,700 --> 00:44:36,200
Hence.

1023
00:44:36,200 --> 00:44:39,000
The collected data
is dropped into an rdd.

1024
00:44:39,000 --> 00:44:42,000
And guess what the rdds
a free-spirited they

1025
00:44:42,000 --> 00:44:44,051
can process any type of data.

1026
00:44:44,051 --> 00:44:47,800
They won't care if the data
is structured unstructured

1027
00:44:47,800 --> 00:44:49,500
or semi-structured now,

1028
00:44:49,500 --> 00:44:51,200
let me walk you
through the features.

1029
00:44:51,200 --> 00:44:52,300
Just of rdds,

1030
00:44:52,300 --> 00:44:54,700
which give it an edge
over the other Alternatives

1031
00:44:54,900 --> 00:44:57,100
in memory computation the idea

1032
00:44:57,100 --> 00:45:00,632
of in memory computation bought
the groundbreaking progress

1033
00:45:00,632 --> 00:45:03,800
in cluster Computing it
increase the processing speed

1034
00:45:03,800 --> 00:45:07,877
when compared with the hdfs
moving on to Lacey evaluations

1035
00:45:07,877 --> 00:45:08,827
the phrase lazy

1036
00:45:08,827 --> 00:45:09,527
Explains It

1037
00:45:09,527 --> 00:45:12,564
All spark logs all
the Transformations you apply

1038
00:45:12,564 --> 00:45:16,056
onto it and will not throw
any output onto the display

1039
00:45:16,056 --> 00:45:17,900
until an action is provoked.

1040
00:45:17,900 --> 00:45:22,200
Next is Fault tolerance rdds
are Lutely, fault-tolerant.

1041
00:45:22,200 --> 00:45:26,008
Any lost partition of an rdd
can be rolled back by applying

1042
00:45:26,008 --> 00:45:28,700
simple Transformations on
to the last part

1043
00:45:28,700 --> 00:45:30,286
in the lineage speaking

1044
00:45:30,286 --> 00:45:34,700
about immutability the data once
dropped into an rdd is immutable

1045
00:45:34,700 --> 00:45:38,016
because the access provided
by our DD is just re

1046
00:45:38,016 --> 00:45:39,920
only the only way to access

1047
00:45:39,920 --> 00:45:43,800
or modified is by applying
a transformation on to an rdd

1048
00:45:43,800 --> 00:45:45,400
which is prior
to the present one

1049
00:45:45,400 --> 00:45:47,200
discussing about partitioning.

1050
00:45:47,200 --> 00:45:48,923
The important reason for Sparks.

1051
00:45:48,923 --> 00:45:51,100
Parallel processing is
its part issue.

1052
00:45:51,300 --> 00:45:54,163
By default spot determines
the number of Parts

1053
00:45:54,163 --> 00:45:56,200
into which your data is divided,

1054
00:45:56,200 --> 00:45:59,652
but you can override this
and decide the number of blocks.

1055
00:45:59,652 --> 00:46:01,200
You want to split your data.

1056
00:46:01,200 --> 00:46:03,193
Let's see what persistence is

1057
00:46:03,193 --> 00:46:05,600
Sparks are it is
a totally reusable.

1058
00:46:05,600 --> 00:46:06,757
The users can apply

1059
00:46:06,757 --> 00:46:09,502
certain number of
Transformations on to an rdd

1060
00:46:09,502 --> 00:46:11,302
and preserve the final Oddity

1061
00:46:11,302 --> 00:46:14,383
for future use this avoids
all the hectic process

1062
00:46:14,383 --> 00:46:17,369
of applying all
the Transformations from scratch

1063
00:46:17,369 --> 00:46:20,867
and now last but not the least
course crane operations.

1064
00:46:20,867 --> 00:46:24,300
The operations performed
on rdds using Transformations

1065
00:46:24,300 --> 00:46:28,069
like map filter flat map
Etc change the arteries

1066
00:46:28,069 --> 00:46:29,300
and update them.

1067
00:46:29,300 --> 00:46:29,686
Hence.

1068
00:46:29,686 --> 00:46:33,100
Every operation applied
onto an RTD is course trained.

1069
00:46:33,100 --> 00:46:36,800
These are the features of rdds
and moving on to the next stage.

1070
00:46:36,800 --> 00:46:37,800
We shall understand.

1071
00:46:37,800 --> 00:46:39,700
The creation of rdds art.

1072
00:46:39,700 --> 00:46:42,500
It is can be created
using three methods.

1073
00:46:42,500 --> 00:46:46,000
The first method is using
parallelized collections.

1074
00:46:46,000 --> 00:46:50,400
Next method is by using external
storage like hdfs hbase.

1075
00:46:50,400 --> 00:46:51,100
Hi.

1076
00:46:51,100 --> 00:46:54,700
And many more the third one
is using an existing ID,

1077
00:46:54,700 --> 00:46:56,800
which is prior
to the present one.

1078
00:46:56,800 --> 00:46:58,800
Now, let us see understand

1079
00:46:58,800 --> 00:47:02,300
and create an array D
through each method now

1080
00:47:02,300 --> 00:47:05,600
Spa can be run on Virtual
machines like spark VM

1081
00:47:05,600 --> 00:47:08,300
or you can install
a Linux operating system

1082
00:47:08,300 --> 00:47:10,774
like Ubuntu and
run it Standalone,

1083
00:47:10,774 --> 00:47:14,600
but we here at Erica use
the best-in-class cloud lab

1084
00:47:14,600 --> 00:47:16,900
which comprises of
all the Frameworks.

1085
00:47:16,900 --> 00:47:19,400
You needed a single
stop Cloud framework.

1086
00:47:19,400 --> 00:47:20,776
No need of any hectic.

1087
00:47:20,776 --> 00:47:22,323
Has of downloading any file

1088
00:47:22,323 --> 00:47:24,632
or setting up
an environment variables

1089
00:47:24,632 --> 00:47:27,289
and looking for
a hardware specification Etc.

1090
00:47:27,289 --> 00:47:28,890
All you need is a login ID

1091
00:47:28,890 --> 00:47:32,091
and password to the all-in-one
ready to use cloud lab

1092
00:47:32,091 --> 00:47:34,800
where you can run
and save all your programs.

1093
00:47:35,400 --> 00:47:39,600
Let us fire up our spark shell
using the command spark to -

1094
00:47:39,600 --> 00:47:42,446
shell now as partial
is been fired up.

1095
00:47:42,446 --> 00:47:44,215
Let's create a new rdd.

1096
00:47:44,800 --> 00:47:48,400
So here we are creating
a new RTD with the first method

1097
00:47:48,400 --> 00:47:51,500
which is using the
parallelized collections here.

1098
00:47:51,500 --> 00:47:52,954
We are creating a new rdt

1099
00:47:52,954 --> 00:47:55,800
by the name parallelized
collections are ready.

1100
00:47:55,800 --> 00:47:57,705
We are starting a spark context

1101
00:47:57,705 --> 00:48:00,321
and we have paralyzing
an array into the rdd

1102
00:48:00,321 --> 00:48:03,300
which consists of the data
of the days of a week,

1103
00:48:03,300 --> 00:48:04,875
which is Monday Tuesday,

1104
00:48:04,875 --> 00:48:07,500
Wednesday, Thursday,
Friday and Saturday.

1105
00:48:07,500 --> 00:48:10,600
Now, let's create
this our new rdd

1106
00:48:10,600 --> 00:48:13,841
paralyzed collections rdd
is successfully created now,

1107
00:48:13,841 --> 00:48:16,900
let's display the data
which is present in our RTD.

1108
00:48:19,400 --> 00:48:23,630
So this was the data
which is present in our RTD now,

1109
00:48:23,630 --> 00:48:27,038
let's create a new ID
using a second method.

1110
00:48:28,200 --> 00:48:30,892
The second method
of creating an rdd

1111
00:48:30,892 --> 00:48:35,400
was using an external storage
such as hdfs high SQL

1112
00:48:35,600 --> 00:48:37,100
and many more here.

1113
00:48:37,100 --> 00:48:40,200
I'm creating a new rdd
by the name spark file

1114
00:48:40,200 --> 00:48:43,312
where I'll be loading
a text document into the rdd

1115
00:48:43,312 --> 00:48:44,900
from an external storage,

1116
00:48:44,900 --> 00:48:45,900
which is hdfs.

1117
00:48:45,900 --> 00:48:49,700
And this is the location
where my text file is located.

1118
00:48:49,800 --> 00:48:53,600
So the new rdd spark file
is successfully created now,

1119
00:48:53,600 --> 00:48:55,054
let's display the data

1120
00:48:55,054 --> 00:48:57,500
which is present
in as pack file a TD.

1121
00:48:58,700 --> 00:48:59,620
It's the data

1122
00:48:59,620 --> 00:49:02,241
which is present in
as pack file ID is

1123
00:49:02,241 --> 00:49:05,500
a collection of alphabets
starting from A to Z.

1124
00:49:05,500 --> 00:49:05,900
Now.

1125
00:49:05,900 --> 00:49:08,851
Let's create a new already
using the third method

1126
00:49:08,851 --> 00:49:10,946
which is using
an existing iridium,

1127
00:49:10,946 --> 00:49:14,201
which is prior to the present
one in the third method.

1128
00:49:14,201 --> 00:49:16,900
I'm creating a new Rd
by the name verts and

1129
00:49:16,900 --> 00:49:18,700
I'm creating a spark context

1130
00:49:18,700 --> 00:49:21,803
and paralyzing a statement
into the RTD Words,

1131
00:49:21,803 --> 00:49:24,700
which is spark is
a very powerful language.

1132
00:49:24,800 --> 00:49:26,517
So this is
a collection of Words,

1133
00:49:26,517 --> 00:49:28,400
which I have passed
into the new.

1134
00:49:28,400 --> 00:49:29,400
You are DD words.

1135
00:49:29,400 --> 00:49:29,900
Now.

1136
00:49:29,900 --> 00:49:31,700
Let us apply a transformation

1137
00:49:31,700 --> 00:49:34,800
on to the RTD and create
a new artery through that.

1138
00:49:35,100 --> 00:49:37,656
So here I'm applying
map transformation

1139
00:49:37,656 --> 00:49:39,140
on to the previous rdd

1140
00:49:39,140 --> 00:49:42,717
that is words and I'm storing
the data into the new ID

1141
00:49:42,717 --> 00:49:44,000
which is WordPress.

1142
00:49:44,000 --> 00:49:46,500
So here we are applying
map transformation in order

1143
00:49:46,500 --> 00:49:49,645
to display the first letter
of each and every word

1144
00:49:49,645 --> 00:49:51,700
which is stored
in the RTD words.

1145
00:49:51,700 --> 00:49:53,200
Now, let's continue.

1146
00:49:53,200 --> 00:49:56,093
The transformation is been
applied successfully now,

1147
00:49:56,093 --> 00:49:59,300
let's display the contents
which are present in new ID

1148
00:49:59,300 --> 00:50:01,800
which is word pair So

1149
00:50:01,800 --> 00:50:05,100
as explained we have displayed
the starting letter of each

1150
00:50:05,100 --> 00:50:06,100
and every word

1151
00:50:06,100 --> 00:50:10,888
as s is starting letter of spark
is starting letter of East and

1152
00:50:10,888 --> 00:50:13,700
so on L is starting
letter of language.

1153
00:50:13,900 --> 00:50:17,000
Now, we have understood
the creation of a dedes.

1154
00:50:17,000 --> 00:50:17,823
Let us move on

1155
00:50:17,823 --> 00:50:21,000
to the next stage where we'll
understand the operations

1156
00:50:21,000 --> 00:50:23,716
that are performed
on rdds Transformations

1157
00:50:23,716 --> 00:50:26,300
and actions are
the two major operations

1158
00:50:26,300 --> 00:50:27,700
that are performed on added.

1159
00:50:27,700 --> 00:50:31,677
He's let us understand what
our Transformations we applied.

1160
00:50:31,677 --> 00:50:35,575
Summations in order to access
filter and modify the data

1161
00:50:35,575 --> 00:50:37,470
which is present in an rdd.

1162
00:50:37,470 --> 00:50:41,087
Now Transformations are further
divided into two types

1163
00:50:41,087 --> 00:50:44,500
narrow Transformations and
why Transformations now,

1164
00:50:44,500 --> 00:50:47,500
let us understand what
our narrow Transformations

1165
00:50:47,500 --> 00:50:50,200
we apply narrow Transformations
onto a single partition

1166
00:50:50,200 --> 00:50:51,400
of parent ID

1167
00:50:51,400 --> 00:50:54,886
because the data required
to process the RTD is available

1168
00:50:54,886 --> 00:50:56,200
on a single partition

1169
00:50:56,200 --> 00:50:58,200
of parent additi the examples

1170
00:50:58,200 --> 00:51:01,125
for neurotransmission
our map filter.

1171
00:51:01,500 --> 00:51:04,300
At map partition
and map partitions.

1172
00:51:04,400 --> 00:51:06,940
Let us move on to the next
type of Transformations

1173
00:51:06,940 --> 00:51:08,511
which is why Transformations.

1174
00:51:08,511 --> 00:51:11,600
We apply why Transformations
on to the multiple partitions

1175
00:51:11,600 --> 00:51:12,698
of parent a greedy

1176
00:51:12,698 --> 00:51:16,080
because the data required
to process an rdd is available

1177
00:51:16,080 --> 00:51:17,514
on multiple partitions

1178
00:51:17,514 --> 00:51:19,600
of the parent
additi the examples

1179
00:51:19,600 --> 00:51:23,000
for why Transformations
are reduced by and Union now,

1180
00:51:23,000 --> 00:51:24,823
let us move on to the next part

1181
00:51:24,823 --> 00:51:27,200
which is actions actions
on the other hand

1182
00:51:27,200 --> 00:51:29,802
are considered to be
the next part of operations,

1183
00:51:29,802 --> 00:51:31,700
which are used
to display the final.

1184
00:51:32,200 --> 00:51:35,793
The examples for actions
are collect count take

1185
00:51:35,800 --> 00:51:38,479
and first till now
we have discussed

1186
00:51:38,479 --> 00:51:40,700
about the theory part on rdd.

1187
00:51:40,700 --> 00:51:42,870
Let us start
executing the operations

1188
00:51:42,870 --> 00:51:44,800
that are performed on a disease.

1189
00:51:46,500 --> 00:51:49,100
In a practical part
will be dealing with an example

1190
00:51:49,100 --> 00:51:50,600
of IPL match stata.

1191
00:51:50,900 --> 00:51:52,900
So here I have a CSV file

1192
00:51:52,900 --> 00:51:57,158
which has the IPL match records
and this CSV file is stored

1193
00:51:57,158 --> 00:51:59,081
in my hdfs and I'm loading.

1194
00:51:59,081 --> 00:52:01,956
My batch is dot CSV file
into the new rdd,

1195
00:52:01,956 --> 00:52:04,200
which is CK file as a text file.

1196
00:52:04,200 --> 00:52:07,909
So the match is dot CSV file
is been successfully loaded

1197
00:52:07,909 --> 00:52:09,990
as a text file into the new ID,

1198
00:52:09,990 --> 00:52:11,400
which is CK file now,

1199
00:52:11,400 --> 00:52:13,759
let us display the data
which is present

1200
00:52:13,759 --> 00:52:16,300
in our seek a file
using an action command.

1201
00:52:16,400 --> 00:52:18,170
So collect is the action command

1202
00:52:18,170 --> 00:52:20,700
which I'm using in order
to display the data

1203
00:52:20,700 --> 00:52:23,100
which is present
in my CK file a DD.

1204
00:52:23,600 --> 00:52:27,569
So here we have in total
six hundred and thirty six rows

1205
00:52:27,569 --> 00:52:30,600
of data which consists
of IPL match records

1206
00:52:30,600 --> 00:52:33,500
from the year 2008 to 2017.

1207
00:52:33,711 --> 00:52:36,788
Now, let us see the schema
of a CSV file.

1208
00:52:37,300 --> 00:52:40,561
I am using the action command
first in order to display

1209
00:52:40,561 --> 00:52:42,800
the schema of a match
is dot CSV file.

1210
00:52:42,800 --> 00:52:45,300
So this command will display
the starting line

1211
00:52:45,300 --> 00:52:46,400
of the CSV file.

1212
00:52:46,400 --> 00:52:48,005
We have so the schema

1213
00:52:48,005 --> 00:52:51,600
of a CSV file is the ID
of the match season city

1214
00:52:51,600 --> 00:52:54,386
where the IPL match
was conducted date

1215
00:52:54,386 --> 00:52:57,700
of the match team one team
two and so on now,

1216
00:52:57,700 --> 00:53:01,100
let's perform the further
operations on a CSV file.

1217
00:53:02,000 --> 00:53:04,300
Now moving on
to the further operations.

1218
00:53:04,300 --> 00:53:07,800
I'm about to split
the second column of my CSV file

1219
00:53:07,800 --> 00:53:10,787
which consists the information
regarding the states

1220
00:53:10,787 --> 00:53:12,700
which conducted the IPL matches.

1221
00:53:12,700 --> 00:53:15,467
So I am using this operation
in order to display

1222
00:53:15,467 --> 00:53:18,000
the states where
the matches were conducted.

1223
00:53:18,700 --> 00:53:21,600
So the transformation
is been successfully applied

1224
00:53:21,600 --> 00:53:24,600
and the data has been stored
into the new ID which is States.

1225
00:53:24,600 --> 00:53:26,700
Now, let's display the data
which is stored

1226
00:53:26,700 --> 00:53:30,100
in our state's rdd using
the collection action command,

1227
00:53:30,400 --> 00:53:31,890
so these with The states

1228
00:53:31,890 --> 00:53:34,500
where the matches
were being conducted now,

1229
00:53:34,500 --> 00:53:35,817
let's find out the city

1230
00:53:35,817 --> 00:53:38,700
which conducted the maximum
number of IPL matches.

1231
00:53:39,400 --> 00:53:41,700
Yeah, I'm creating
a new ID again,

1232
00:53:41,700 --> 00:53:45,017
which is States count
and I'm using map transformation

1233
00:53:45,017 --> 00:53:47,799
and I am counting each
and every city and the number

1234
00:53:47,799 --> 00:53:50,200
of matches conducted
in that particular City.

1235
00:53:50,500 --> 00:53:52,776
The transformation
is successfully applied

1236
00:53:52,776 --> 00:53:55,600
and the data has been stored
into the account ID.

1237
00:53:56,400 --> 00:53:56,900
Now.

1238
00:53:56,900 --> 00:54:00,097
Let us create a new editing
by name State count em

1239
00:54:00,097 --> 00:54:01,414
and apply reduced by

1240
00:54:01,414 --> 00:54:04,572
key transformation and map
transformation together

1241
00:54:04,572 --> 00:54:07,900
and consider topple one as
the city name and toppled

1242
00:54:07,900 --> 00:54:09,500
to as the Number of matches

1243
00:54:09,500 --> 00:54:11,876
which were considered
in that particular City

1244
00:54:11,876 --> 00:54:12,701
and apply sort

1245
00:54:12,701 --> 00:54:15,000
by K transformation
to find out the city

1246
00:54:15,000 --> 00:54:17,700
which conducted maximum number
of IPL matches.

1247
00:54:17,900 --> 00:54:20,317
The Transformations
are successfully applied

1248
00:54:20,317 --> 00:54:23,200
and the data is being stored
into the state count.

1249
00:54:23,200 --> 00:54:25,200
Em RTD now let's
display the data

1250
00:54:25,200 --> 00:54:26,800
which is present in state count.

1251
00:54:26,800 --> 00:54:29,600
Em, I did here I am using

1252
00:54:29,600 --> 00:54:33,320
take action command in order
to take the top 10 results

1253
00:54:33,320 --> 00:54:35,800
which are stored
in state count MRDD.

1254
00:54:36,100 --> 00:54:38,600
So according to the results
we have Mumbai

1255
00:54:38,600 --> 00:54:41,300
which Get the maximum number
of IPL matches,

1256
00:54:41,300 --> 00:54:45,700
which is 85 since the year
2008 to the year 2017.

1257
00:54:46,400 --> 00:54:50,300
Now let us create a new ID
by name fil ardi and use

1258
00:54:50,300 --> 00:54:53,144
flat map in order to filter
out the match data

1259
00:54:53,144 --> 00:54:55,800
which were conducted
in the city Hydra path

1260
00:54:55,800 --> 00:54:58,500
and store the same data
into the file rdd

1261
00:54:58,500 --> 00:55:01,617
since transformation is been
successfully applied now,

1262
00:55:01,617 --> 00:55:04,600
let us display the data
which is present in our fil ardi

1263
00:55:04,600 --> 00:55:06,161
which consists of the matches

1264
00:55:06,161 --> 00:55:08,800
which were conducted
excluding the city Hyderabad.

1265
00:55:09,900 --> 00:55:11,126
So this is the data

1266
00:55:11,126 --> 00:55:15,000
which is present in our fil ardi
D which excludes the matches

1267
00:55:15,000 --> 00:55:18,000
which are played
in the city Hyderabad now,

1268
00:55:18,000 --> 00:55:19,768
let us create another rdd

1269
00:55:19,768 --> 00:55:22,773
by name fil and store
the data of the matches

1270
00:55:22,773 --> 00:55:25,300
which were conducted
in the year 2017.

1271
00:55:25,300 --> 00:55:27,394
We shall use
filter transformation

1272
00:55:27,394 --> 00:55:28,600
for this operation.

1273
00:55:28,700 --> 00:55:31,000
The transformation is
been applied successfully

1274
00:55:31,000 --> 00:55:34,100
and the data has been stored
into the fil ardi now,

1275
00:55:34,100 --> 00:55:36,600
let us display the data
which is present there.

1276
00:55:37,200 --> 00:55:38,588
Michelle use collect

1277
00:55:38,588 --> 00:55:42,545
action command and now we have
the data of all the matches

1278
00:55:42,545 --> 00:55:45,600
which your plate especially
in the year 2070.

1279
00:55:47,100 --> 00:55:49,400
similarly, we can find
out the matches

1280
00:55:49,400 --> 00:55:52,000
which were played
in the year 2016 and we

1281
00:55:52,000 --> 00:55:54,600
can save the same data
into the new rdd

1282
00:55:54,600 --> 00:55:57,500
which is fil to Similarly,

1283
00:55:57,500 --> 00:55:59,823
we can find out the data
of the matches

1284
00:55:59,823 --> 00:56:03,100
which were conducted in the year
2016 and we can store

1285
00:56:03,100 --> 00:56:05,061
the same data into our new rdd

1286
00:56:05,061 --> 00:56:08,200
which is fil to I
have used filter transformation

1287
00:56:08,200 --> 00:56:10,800
in order to filter out
the data of the matches

1288
00:56:10,800 --> 00:56:13,581
which were conducted
in the year 2016 and I

1289
00:56:13,581 --> 00:56:15,900
have saved the data
into the new RTD

1290
00:56:15,900 --> 00:56:18,300
which is a file to now,

1291
00:56:18,300 --> 00:56:20,889
let us understand
the union transformation

1292
00:56:20,889 --> 00:56:21,900
which will apply

1293
00:56:21,900 --> 00:56:26,400
the union transformation on
to the fil ardi and fil to rdd.

1294
00:56:26,400 --> 00:56:29,100
In order to combine
both the data is present

1295
00:56:29,100 --> 00:56:30,816
in both The Oddities here.

1296
00:56:30,816 --> 00:56:32,232
I'm creating a new rdd

1297
00:56:32,232 --> 00:56:35,931
by the name Union rdd and I'm
applying Union transformation

1298
00:56:35,931 --> 00:56:38,600
on the to Oddities
that we created before.

1299
00:56:38,600 --> 00:56:42,400
The first one is fil ardi
which consists of the data

1300
00:56:42,400 --> 00:56:44,818
of the matches played
in the year 2017.

1301
00:56:44,818 --> 00:56:46,633
And the second one is a file

1302
00:56:46,633 --> 00:56:49,295
to which consists
the data of the matches.

1303
00:56:49,295 --> 00:56:52,469
Which up late in the year
2016 here I'll be clubbing

1304
00:56:52,469 --> 00:56:53,921
both the R8 is together

1305
00:56:53,921 --> 00:56:56,700
and I'll be saving the data
into the new rdd.

1306
00:56:56,701 --> 00:56:58,163
Which is Union rdd.

1307
00:56:58,600 --> 00:57:02,600
Now let us display the data
which is present in a new array,

1308
00:57:02,600 --> 00:57:04,100
which is Union rgd.

1309
00:57:04,100 --> 00:57:06,100
I am using collect
action command in order

1310
00:57:06,100 --> 00:57:07,100
to display the data.

1311
00:57:07,300 --> 00:57:09,800
So here we have the data
of the matches

1312
00:57:09,800 --> 00:57:11,400
which were played in the u.s.

1313
00:57:11,400 --> 00:57:13,400
2016 and 2017.

1314
00:57:13,900 --> 00:57:16,306
And now let's continue
with our operations

1315
00:57:16,306 --> 00:57:19,188
and find out the player
with maximum number of man

1316
00:57:19,188 --> 00:57:21,603
of the match awards
for this operation.

1317
00:57:21,603 --> 00:57:23,293
I am applying map transformation

1318
00:57:23,293 --> 00:57:25,345
and splitting out
the column number 13,

1319
00:57:25,345 --> 00:57:28,314
which consists of the data
of the players who won the man

1320
00:57:28,314 --> 00:57:30,800
of the match awards
for that particular match.

1321
00:57:30,800 --> 00:57:33,252
So the transformation
is been successfully applied

1322
00:57:33,252 --> 00:57:35,752
and the column number
13 is been successfully split

1323
00:57:35,752 --> 00:57:37,700
and the data has been
stored into the man

1324
00:57:37,700 --> 00:57:39,238
of the match our DD now.

1325
00:57:39,238 --> 00:57:42,155
We are creating a new rdd
by the named man

1326
00:57:42,155 --> 00:57:45,600
of the match count me applying
map Transformations on

1327
00:57:45,600 --> 00:57:46,800
to a previous rdd

1328
00:57:46,800 --> 00:57:48,300
and we are counting the number

1329
00:57:48,300 --> 00:57:51,300
of awards won by each and
every particular player.

1330
00:57:51,700 --> 00:57:55,733
Now, we shall create a new ID
by the named man of the match

1331
00:57:55,733 --> 00:57:59,500
and we are applying reduced
by K. Under the previous added

1332
00:57:59,500 --> 00:58:01,311
which is man of the match count.

1333
00:58:01,311 --> 00:58:03,765
And again, we are applying
map transformation

1334
00:58:03,765 --> 00:58:06,600
and considering topple one
as the name of the player

1335
00:58:06,600 --> 00:58:08,843
and topple to as
the number of matches.

1336
00:58:08,843 --> 00:58:11,500
He played and won the man
of the match Awards,

1337
00:58:11,500 --> 00:58:14,794
let us use take action command
in order to print the data

1338
00:58:14,794 --> 00:58:18,000
which is stored in our new RTD
which is man of the match.

1339
00:58:18,200 --> 00:58:21,400
So according to the result
we have a bws

1340
00:58:21,400 --> 00:58:24,000
who won the maximum number
of man of the matches,

1341
00:58:24,000 --> 00:58:24,923
which is 15.

1342
00:58:25,800 --> 00:58:29,129
So these are the few operations
that were performed on rdds.

1343
00:58:29,129 --> 00:58:31,600
Now, let us move on
to our Pokémon use case

1344
00:58:31,600 --> 00:58:34,800
so that we can understand
our duties in a much better way.

1345
00:58:35,800 --> 00:58:39,300
So the steps to be performed
in Pokémon use cases are loading

1346
00:58:39,300 --> 00:58:41,164
the Pokemon data dot CSV file

1347
00:58:41,164 --> 00:58:44,624
from an external storage
into an rdd removing the schema

1348
00:58:44,624 --> 00:58:46,700
from the Pokémon
data dot CSV file

1349
00:58:46,700 --> 00:58:49,730
and finding out the total number
of water type Pokemon

1350
00:58:49,730 --> 00:58:52,117
finding the total number
of fire type Pokemon.

1351
00:58:52,117 --> 00:58:53,882
I know it's getting interesting.

1352
00:58:53,882 --> 00:58:57,000
So let me explain you each
and every step practically.

1353
00:58:57,700 --> 00:59:00,200
So here I am creating
a new identity

1354
00:59:00,200 --> 00:59:02,400
by name Pokemon data rdd one

1355
00:59:02,400 --> 00:59:05,700
and I'm loading my CSV file
from an external storage.

1356
00:59:05,700 --> 00:59:08,100
That is my hdfs as a text file.

1357
00:59:08,100 --> 00:59:11,800
So the Pokemon data dot CSV file
is been successfully loaded

1358
00:59:11,800 --> 00:59:12,800
into our new rdd.

1359
00:59:12,800 --> 00:59:14,100
So let us display the data

1360
00:59:14,100 --> 00:59:17,100
which is present
in our Pokémon data rdd one.

1361
00:59:17,200 --> 00:59:19,700
I am using collect
action command for this.

1362
00:59:20,000 --> 00:59:23,900
So here we have 721 rows
of data of all the types

1363
00:59:23,900 --> 00:59:28,979
of Pokemons we have So now
let us display the schema

1364
00:59:28,979 --> 00:59:30,441
of the data we have

1365
00:59:30,700 --> 00:59:33,900
I have used the action command
first in order to display

1366
00:59:33,900 --> 00:59:35,727
the first line of a CSV file

1367
00:59:35,727 --> 00:59:38,600
which happens to be
the schema of a CSV file.

1368
00:59:38,600 --> 00:59:40,000
So we have index

1369
00:59:40,000 --> 00:59:42,100
of the Pokemon name
of the Pokémon.

1370
00:59:42,100 --> 00:59:46,700
Its type total points
HP attack points defense points

1371
00:59:46,992 --> 00:59:50,607
special attack special
defense speed generation,

1372
00:59:50,700 --> 00:59:51,938
and we can also find

1373
00:59:51,938 --> 00:59:54,600
if a particular Pokemon
is legendary or not.

1374
00:59:55,773 --> 00:59:57,926
Here, I'm creating a new RTD

1375
00:59:58,000 --> 00:59:59,400
which is no header

1376
00:59:59,400 --> 01:00:02,800
and I'm using filter operation
in order to remove the schema

1377
01:00:02,800 --> 01:00:04,900
of a Pokemon data dot CSV file.

1378
01:00:04,900 --> 01:00:08,407
The schema of Pokemon data
dot CSV file is been removed

1379
01:00:08,407 --> 01:00:10,705
because the spark
considers the schema

1380
01:00:10,705 --> 01:00:12,300
as a data to be processed.

1381
01:00:12,300 --> 01:00:13,480
So for this reason,

1382
01:00:13,480 --> 01:00:16,500
we remove the schema now,
let's display the data

1383
01:00:16,500 --> 01:00:19,000
which is present
in a no-hitter rdd.

1384
01:00:19,000 --> 01:00:20,441
I am using action command

1385
01:00:20,441 --> 01:00:22,500
collect in order
to display the data

1386
01:00:22,500 --> 01:00:24,700
which is present
in no header rdd.

1387
01:00:24,900 --> 01:00:26,104
So this is the data

1388
01:00:26,104 --> 01:00:28,195
which is stored
in a no-hitter rdd

1389
01:00:28,195 --> 01:00:29,400
without the schema.

1390
01:00:31,200 --> 01:00:33,978
So now let us find out
the number of partitions

1391
01:00:33,978 --> 01:00:37,300
into which are no header are
ready is been split in two.

1392
01:00:37,300 --> 01:00:40,320
So I am using partitions
transformation in order to find

1393
01:00:40,320 --> 01:00:42,060
out the number of partitions.

1394
01:00:42,060 --> 01:00:45,000
The data was split
in two according to the result.

1395
01:00:45,000 --> 01:00:48,300
The no header rdd is been split
into two partitions.

1396
01:00:48,600 --> 01:00:52,000
I am here creating a new rdt
by name water rdd

1397
01:00:52,000 --> 01:00:55,100
and I'm using filter
transformation in order to find

1398
01:00:55,100 --> 01:00:59,000
out what a type Pokemons in
our Pokémon data dot CSV file.

1399
01:00:59,600 --> 01:01:02,800
I'm using action command collect
in order to print the data

1400
01:01:02,800 --> 01:01:04,900
which is present in water rdd.

1401
01:01:05,200 --> 01:01:08,000
So these are the total number
of water type Pokemon

1402
01:01:08,000 --> 01:01:10,528
that we have in our
Pokémon data dot CSV.

1403
01:01:10,528 --> 01:01:11,160
Similarly.

1404
01:01:11,160 --> 01:01:13,500
Let's find out
the fire type Pokemons.

1405
01:01:14,600 --> 01:01:17,500
I'm creating a new identity
by the name fire RTD

1406
01:01:17,500 --> 01:01:20,523
and applying filter operation
in order to find out

1407
01:01:20,523 --> 01:01:23,300
the fire type Pokemon
present in our CSV file.

1408
01:01:24,200 --> 01:01:27,200
I'm using collect action command
in order to print the data

1409
01:01:27,200 --> 01:01:29,200
which is present in fire rdd.

1410
01:01:29,400 --> 01:01:32,100
So these are the fire type
Pokemon which are present

1411
01:01:32,100 --> 01:01:34,400
in our Pokémon
data dot CSV file.

1412
01:01:34,600 --> 01:01:37,600
Now, let us count the total
number of water type Pokemon

1413
01:01:37,600 --> 01:01:40,400
which are present
in a Pokemon data dot CSV file.

1414
01:01:40,400 --> 01:01:44,500
I am using count action for this
and we have 112 water type

1415
01:01:44,500 --> 01:01:47,400
Pokemon is present in
our Pokémon data dot CSV file.

1416
01:01:47,400 --> 01:01:47,924
Similarly.

1417
01:01:47,924 --> 01:01:50,600
Let's find out the total number
of fire-type Pokémon

1418
01:01:50,600 --> 01:01:54,300
as we have I'm using count
action command for the same.

1419
01:01:54,300 --> 01:01:56,178
So we have a total 52 number

1420
01:01:56,178 --> 01:01:59,800
of fire type Pokemon Sinnoh
Pokemon data dot CSV files.

1421
01:01:59,800 --> 01:02:01,992
Let's continue with
our further operations

1422
01:02:01,992 --> 01:02:05,200
where we'll find out a highest
defense strength of a Pokémon.

1423
01:02:05,300 --> 01:02:08,400
I am creating a new ID
by the name defense list

1424
01:02:08,400 --> 01:02:10,400
and I'm applying
map transformation

1425
01:02:10,400 --> 01:02:12,935
and spreading out
the column number six in order

1426
01:02:12,935 --> 01:02:14,500
to extract the defense points

1427
01:02:14,500 --> 01:02:18,100
of all the Pokemons present in
our Pokémon data dot CSV file.

1428
01:02:18,300 --> 01:02:21,400
So the data is been stored
successfully into a new era.

1429
01:02:21,400 --> 01:02:23,100
DD which is defenseless.

1430
01:02:23,500 --> 01:02:23,700
Now.

1431
01:02:23,700 --> 01:02:26,249
I'm using Mac's action command
in order to print out

1432
01:02:26,249 --> 01:02:29,100
the maximum different strengths
out of all the Pokemons.

1433
01:02:29,200 --> 01:02:32,576
So we have 230 points as
the maximum defense strength

1434
01:02:32,576 --> 01:02:34,200
amongst all the Pokemons.

1435
01:02:34,200 --> 01:02:35,702
So in our further operations,

1436
01:02:35,702 --> 01:02:38,502
let's find out the Pokemons
which come under the category

1437
01:02:38,502 --> 01:02:40,600
of having highest
different strengths,

1438
01:02:40,600 --> 01:02:42,400
which is 230 points.

1439
01:02:43,100 --> 01:02:45,456
In order to find out
the name of the Pokemon

1440
01:02:45,456 --> 01:02:47,100
with highest defense strength.

1441
01:02:47,100 --> 01:02:49,182
I'm creating a new identity
with the name.

1442
01:02:49,182 --> 01:02:51,717
It defense with Pokemon name
and I'm applying

1443
01:02:51,717 --> 01:02:54,000
May transformation on
to the previous array,

1444
01:02:54,000 --> 01:02:55,000
which is no header

1445
01:02:55,000 --> 01:02:56,062
and I'm splitting out

1446
01:02:56,062 --> 01:02:59,100
column number six which happens
to be the different strengths

1447
01:02:59,100 --> 01:03:02,300
in order to extract the data
from that particular row,

1448
01:03:02,300 --> 01:03:05,100
which has the defense
strength as 230 points.

1449
01:03:05,769 --> 01:03:08,230
Now I'm creating a new RTD again

1450
01:03:08,300 --> 01:03:11,500
with the name maximum defense
Pokemon and I'm applying

1451
01:03:11,500 --> 01:03:15,100
group bike a transformation
in order to display the Pokemon

1452
01:03:15,100 --> 01:03:18,675
which have the maximum defense
points that is 230 points.

1453
01:03:18,675 --> 01:03:20,400
So according to the result.

1454
01:03:20,400 --> 01:03:23,400
We have Steelix Steelix
Mega chacal Aggregate

1455
01:03:23,400 --> 01:03:24,500
and aggregate Mega

1456
01:03:24,500 --> 01:03:27,200
as the Pokemons with
highest different strengths,

1457
01:03:27,200 --> 01:03:28,800
which is 230 points.

1458
01:03:28,800 --> 01:03:31,100
Now we shall find
out the Pokemon

1459
01:03:31,100 --> 01:03:33,600
which is having least
different strengths.

1460
01:03:34,200 --> 01:03:35,900
So before we find
out the Pokemon

1461
01:03:35,900 --> 01:03:37,580
with least different strengths,

1462
01:03:37,580 --> 01:03:39,694
let us find out
the least defense points

1463
01:03:39,694 --> 01:03:41,700
which are present
in the defense list.

1464
01:03:42,900 --> 01:03:45,100
So in order to find
out the Pokémon

1465
01:03:45,100 --> 01:03:46,788
with least different strengths,

1466
01:03:46,788 --> 01:03:48,200
I have created a new rdt

1467
01:03:48,200 --> 01:03:51,654
by name minimum defense Pokemon
and I have applied distinct

1468
01:03:51,654 --> 01:03:54,900
and sort by Transformations
on to the defense list rdd

1469
01:03:54,900 --> 01:03:57,900
in order to extract
the least defense points present

1470
01:03:57,900 --> 01:03:58,955
in the defense list

1471
01:03:58,955 --> 01:04:01,484
and I have used take
action command in order

1472
01:04:01,484 --> 01:04:02,600
to display the data

1473
01:04:02,600 --> 01:04:05,300
which is present
in minimum defense Pokemon rdd.

1474
01:04:05,300 --> 01:04:06,700
So according to the results,

1475
01:04:06,700 --> 01:04:09,300
we have five points as
the least defense strength

1476
01:04:09,300 --> 01:04:11,053
of a particular Pokémon now,

1477
01:04:11,053 --> 01:04:13,148
let us find out
the name of the On

1478
01:04:13,148 --> 01:04:16,650
which comes under the category
of having Five Points as

1479
01:04:16,650 --> 01:04:18,290
different strengths now,

1480
01:04:18,290 --> 01:04:19,808
let us create a new rdd

1481
01:04:19,808 --> 01:04:23,956
which is difference Pokemon name
to and apply my transformation

1482
01:04:23,956 --> 01:04:27,217
and split the column number 6
and store the data

1483
01:04:27,217 --> 01:04:28,259
into our new rdd

1484
01:04:28,259 --> 01:04:30,800
which is defense
with Pokemon name, too.

1485
01:04:32,000 --> 01:04:34,500
The transformation is
been successfully applied

1486
01:04:34,500 --> 01:04:36,970
and the data is now
stored into the new rdd

1487
01:04:36,970 --> 01:04:37,900
which is defense

1488
01:04:37,900 --> 01:04:41,900
with Pokemon name to the data
is been successfully loaded.

1489
01:04:41,900 --> 01:04:45,500
Now, let us apply
the further operations here.

1490
01:04:45,538 --> 01:04:50,000
I am creating another rdd with
name minimum defense Pokemon

1491
01:04:50,000 --> 01:04:53,400
and I'm applying group bike
a transformation in order

1492
01:04:53,400 --> 01:04:55,500
to extract the data from the row

1493
01:04:55,500 --> 01:04:58,206
which has the defense
points as 5.0.

1494
01:04:58,500 --> 01:05:01,829
The data is been successfully
loaded now and let us display.

1495
01:05:01,829 --> 01:05:03,300
The data which is present

1496
01:05:03,300 --> 01:05:07,307
in minimum defense Pokemon rdd
now according to the results.

1497
01:05:07,307 --> 01:05:09,073
We have to number of Pokemons,

1498
01:05:09,073 --> 01:05:12,098
which come under the category
of having Five Points

1499
01:05:12,098 --> 01:05:15,400
as that defense strength
the Pokemons chassis knee

1500
01:05:15,400 --> 01:05:17,500
and happening at
the to Pokemons,

1501
01:05:17,500 --> 01:05:24,500
which I have in the least
definition the world

1502
01:05:24,500 --> 01:05:26,100
of Information Technology

1503
01:05:26,100 --> 01:05:29,786
and big data processing started
to see multiple potentialities

1504
01:05:29,786 --> 01:05:31,600
from spark coming into action.

1505
01:05:31,700 --> 01:05:34,685
Such Pinnacle in Sparks
technology advancements is

1506
01:05:34,685 --> 01:05:35,600
the data frame.

1507
01:05:35,600 --> 01:05:38,200
And today we shall
understand the technicalities

1508
01:05:38,200 --> 01:05:39,000
of data frames

1509
01:05:39,000 --> 01:05:42,500
and Spark a data frame and Spark
is all about performance.

1510
01:05:42,500 --> 01:05:46,300
It is a powerful multifunctional
and an integrated data structure

1511
01:05:46,300 --> 01:05:49,100
where the programmer can work
with different libraries

1512
01:05:49,100 --> 01:05:52,000
and perform numerous
functionalities without breaking

1513
01:05:52,000 --> 01:05:53,529
a sweat to understand apis

1514
01:05:53,529 --> 01:05:54,823
and libraries involved

1515
01:05:54,823 --> 01:05:57,500
in the process
without wasting any time.

1516
01:05:57,500 --> 01:06:00,000
Let us understand a topic
for today's discussion.

1517
01:06:00,000 --> 01:06:01,900
I line up the docket
for understanding.

1518
01:06:01,900 --> 01:06:03,800
Data frames and Spark is below

1519
01:06:03,800 --> 01:06:06,962
which will begin with
what our data frames here.

1520
01:06:06,962 --> 01:06:09,700
We will learn what
exactly a data frame is.

1521
01:06:09,700 --> 01:06:13,706
How does it look like and what
are its functionalities then we

1522
01:06:13,706 --> 01:06:16,400
shall see why do we need
data frames here?

1523
01:06:16,400 --> 01:06:18,900
We shall understand
the requirements which led us

1524
01:06:18,900 --> 01:06:21,200
to the invention
of data frames later.

1525
01:06:21,200 --> 01:06:23,400
I'll walk you through
the important features

1526
01:06:23,400 --> 01:06:24,282
of data frames.

1527
01:06:24,282 --> 01:06:25,400
Then we should look

1528
01:06:25,400 --> 01:06:28,000
into the sources from which
the data frames and Spark

1529
01:06:28,000 --> 01:06:31,000
get their data from Once
the theory part is finished.

1530
01:06:31,000 --> 01:06:33,400
I will get us involved
into the Practical part

1531
01:06:33,400 --> 01:06:35,700
where the creation
of a dataframe happens to be

1532
01:06:35,700 --> 01:06:39,400
a first step next we shall work
with an interesting example,

1533
01:06:39,400 --> 01:06:41,100
which is related to football

1534
01:06:41,100 --> 01:06:43,237
and finally to understand
the data frames

1535
01:06:43,237 --> 01:06:44,200
in spark in a much

1536
01:06:44,200 --> 01:06:46,980
better way we should work
with the most trending topic

1537
01:06:46,980 --> 01:06:47,711
as I use case,

1538
01:06:47,711 --> 01:06:50,300
which is none other
than the Game of Thrones.

1539
01:06:50,400 --> 01:06:52,100
So let's get started.

1540
01:06:52,200 --> 01:06:55,500
What is a data frame
in simple terms a data frame

1541
01:06:55,500 --> 01:06:58,617
can be considered as a
distributed collection of data.

1542
01:06:58,617 --> 01:07:01,156
The data is organized
under named columns,

1543
01:07:01,156 --> 01:07:04,500
which provide us The operations
to filter group process

1544
01:07:04,500 --> 01:07:08,205
and aggregate the available data
data frames can also be used

1545
01:07:08,205 --> 01:07:11,100
with Sparks equal and we
can construct data frames

1546
01:07:11,100 --> 01:07:14,800
from structured data files rdds
or from an external storage

1547
01:07:14,800 --> 01:07:17,500
like hdfs Hive Cassandra hbase

1548
01:07:17,500 --> 01:07:19,676
and many more with
this we should look

1549
01:07:19,676 --> 01:07:21,500
into a more simplified example,

1550
01:07:21,500 --> 01:07:24,455
which will give us a basic
description of a data frame.

1551
01:07:24,455 --> 01:07:26,700
So we shall deal
with an employee database

1552
01:07:26,700 --> 01:07:29,229
where we have entities
and their data types.

1553
01:07:29,229 --> 01:07:31,817
So the name of the employee
is a first entity

1554
01:07:31,817 --> 01:07:33,500
And its respective data type

1555
01:07:33,500 --> 01:07:37,102
is string data type similarly
employee ID has data type

1556
01:07:37,102 --> 01:07:39,004
of string employee phone number

1557
01:07:39,004 --> 01:07:40,646
which is integer data type

1558
01:07:40,646 --> 01:07:43,642
and employ address happens
to be string data type.

1559
01:07:43,642 --> 01:07:46,700
And finally the employee salary
is float data type.

1560
01:07:46,700 --> 01:07:49,500
All this data is stored
into an external storage,

1561
01:07:49,500 --> 01:07:51,093
which may be hdfs Hive

1562
01:07:51,093 --> 01:07:53,700
or Cassandra using
the data frame API

1563
01:07:53,700 --> 01:07:55,200
with their respective schema,

1564
01:07:55,200 --> 01:07:56,500
which consists of the name

1565
01:07:56,500 --> 01:07:58,913
of the entity along
with this data type now

1566
01:07:58,913 --> 01:08:01,900
that we have understood what
exactly a data frame is.

1567
01:08:01,900 --> 01:08:03,910
Let us quickly move on
to our next stage

1568
01:08:03,910 --> 01:08:06,900
where we shall understand the
requirement for a data frame.

1569
01:08:07,000 --> 01:08:07,806
It provides as

1570
01:08:07,806 --> 01:08:10,400
multiple programming
language support ability.

1571
01:08:10,400 --> 01:08:13,670
It has the capacity to work
with multiple data sources,

1572
01:08:13,670 --> 01:08:16,904
it can process both structured
and unstructured data.

1573
01:08:16,904 --> 01:08:19,455
And finally it is
well versed with slicing

1574
01:08:19,455 --> 01:08:20,681
and dicing the data.

1575
01:08:20,681 --> 01:08:21,723
So the first one is

1576
01:08:21,723 --> 01:08:24,900
the support ability for
multiple programming languages.

1577
01:08:24,900 --> 01:08:26,937
The IT industry
is required a powerful

1578
01:08:26,937 --> 01:08:28,700
and an integrated data structure

1579
01:08:28,700 --> 01:08:29,500
which could support

1580
01:08:29,500 --> 01:08:31,800
multiple programming languages
and at the same.

1581
01:08:31,800 --> 01:08:33,900
Same time without
the requirement of

1582
01:08:33,900 --> 01:08:36,900
additional API data frame
was the one stop solution

1583
01:08:36,900 --> 01:08:39,900
which supported multiple
languages along with a single

1584
01:08:39,900 --> 01:08:41,982
API the most popular languages

1585
01:08:41,982 --> 01:08:45,046
that a dataframe could
support our our python.

1586
01:08:45,046 --> 01:08:48,777
Skaila, Java and many more
the next requirement

1587
01:08:48,777 --> 01:08:51,500
was to support
the multiple data sources.

1588
01:08:51,500 --> 01:08:53,608
We all know that in
a real-time approach

1589
01:08:53,608 --> 01:08:55,700
to data processing
will never end up

1590
01:08:55,700 --> 01:08:57,700
at a single data
source data frame is

1591
01:08:57,700 --> 01:08:59,057
one such data structure,

1592
01:08:59,057 --> 01:09:02,000
which has the capability
to support and process data.

1593
01:09:02,000 --> 01:09:05,615
From a variety of data
sources Hadoop Cassandra.

1594
01:09:05,615 --> 01:09:07,207
Json files hbase.

1595
01:09:07,207 --> 01:09:10,284
CSV files are the examples
to name a few.

1596
01:09:10,300 --> 01:09:12,947
The next requirement was
to process structured

1597
01:09:12,947 --> 01:09:14,200
and unstructured data.

1598
01:09:14,200 --> 01:09:17,400
The Big Data environment was
designed to store huge amount

1599
01:09:17,400 --> 01:09:18,487
of data regardless

1600
01:09:18,487 --> 01:09:19,755
of which type exactly

1601
01:09:19,755 --> 01:09:22,827
it is now Sparks data frame
is designed in such a way

1602
01:09:22,827 --> 01:09:25,994
that it can store a huge
collection of both structured

1603
01:09:25,994 --> 01:09:27,249
and unstructured data

1604
01:09:27,249 --> 01:09:29,900
in a tabular format
along with its schema.

1605
01:09:29,900 --> 01:09:33,300
The next requirement was slicing
In in dicing data now,

1606
01:09:33,300 --> 01:09:34,300
the humongous amount

1607
01:09:34,300 --> 01:09:37,400
of data stored in Sparks
data frame can be sliced

1608
01:09:37,400 --> 01:09:40,975
and diced using the operations
like filter select group

1609
01:09:40,975 --> 01:09:42,300
by order by and many

1610
01:09:42,300 --> 01:09:45,100
more these operations
are applied upon the data

1611
01:09:45,100 --> 01:09:47,456
which are stored in form
of rows and columns

1612
01:09:47,456 --> 01:09:50,388
in a data frame these
with a few crucial requirements

1613
01:09:50,388 --> 01:09:52,700
which led to the invention
of data frames.

1614
01:09:52,800 --> 01:09:55,173
Now, let us get
into the important features

1615
01:09:55,173 --> 01:09:55,997
of data frames

1616
01:09:55,997 --> 01:09:58,700
which bring it an edge
over the other alternatives.

1617
01:09:59,100 --> 01:10:02,400
Immutability lazy
evaluation fault tolerance

1618
01:10:02,400 --> 01:10:04,400
and distributed memory storage,

1619
01:10:04,400 --> 01:10:07,800
let us discuss about each
and every feature in detail.

1620
01:10:07,800 --> 01:10:10,600
So the first one is
immutability similar to

1621
01:10:10,600 --> 01:10:13,295
the resilient distributed data
sets the data frames

1622
01:10:13,295 --> 01:10:16,688
and Spark are also immutable
the term immutable depicts

1623
01:10:16,688 --> 01:10:18,100
that the data was stored

1624
01:10:18,100 --> 01:10:20,300
into a data frame
will not be altered.

1625
01:10:20,300 --> 01:10:23,100
The only way to alter the data
present in a data frame

1626
01:10:23,100 --> 01:10:25,700
would be by applying
simple transformation operations

1627
01:10:25,700 --> 01:10:26,600
on to them.

1628
01:10:26,600 --> 01:10:28,900
So the next feature
is lazy evaluation.

1629
01:10:28,900 --> 01:10:32,126
Valuation lazy evaluation
is the key to the remarkable

1630
01:10:32,126 --> 01:10:36,100
performance offered by spark
similar to the rdds data frames

1631
01:10:36,100 --> 01:10:38,999
in spark will not throw
any output onto the screen

1632
01:10:38,999 --> 01:10:41,900
until and unless an action
command is encountered.

1633
01:10:41,900 --> 01:10:44,300
The next feature
is Fault tolerance.

1634
01:10:44,300 --> 01:10:45,182
There is no way

1635
01:10:45,182 --> 01:10:47,900
that the Sparks data frames
can lose their data.

1636
01:10:47,900 --> 01:10:50,300
They follow the principle
of being fault tolerant

1637
01:10:50,300 --> 01:10:51,782
to the unexpected calamities

1638
01:10:51,782 --> 01:10:53,900
which tend to destroy
the available data.

1639
01:10:53,900 --> 01:10:55,893
The next feature is distributed

1640
01:10:55,893 --> 01:10:58,590
storage Sparks dataframe
distribute the data.

1641
01:10:58,590 --> 01:11:00,000
Most multiple locations

1642
01:11:00,000 --> 01:11:03,294
so that in case of a node
failure the next available node

1643
01:11:03,294 --> 01:11:05,900
can takes place to continue
the data processing.

1644
01:11:05,900 --> 01:11:08,700
The next stage will be
about the multiple data source

1645
01:11:08,700 --> 01:11:12,204
that the spark dataframe
can support the spark API

1646
01:11:12,204 --> 01:11:13,690
can integrate itself

1647
01:11:13,690 --> 01:11:17,700
with multiple programming
languages such as scalar Java

1648
01:11:17,700 --> 01:11:19,300
python our MySQL

1649
01:11:19,300 --> 01:11:22,600
and many more making
itself capable to handle

1650
01:11:22,600 --> 01:11:26,700
a variety of data sources
such as Hadoop Hive hbase

1651
01:11:26,800 --> 01:11:28,500
Cassandra, Json file.

1652
01:11:28,600 --> 01:11:31,600
As CSV files my SQL
and many more.

1653
01:11:32,200 --> 01:11:33,726
So this was the theory part

1654
01:11:33,726 --> 01:11:36,100
and now let us move
into the Practical part

1655
01:11:36,100 --> 01:11:37,000
where the creation

1656
01:11:37,000 --> 01:11:39,500
of a dataframe happens
to be a first step.

1657
01:11:40,100 --> 01:11:42,412
So before we begin
the Practical part,

1658
01:11:42,412 --> 01:11:43,975
let us load the libraries

1659
01:11:43,975 --> 01:11:47,600
which required in order to
process the data in data frames.

1660
01:11:48,200 --> 01:11:50,822
So these are the few libraries
which we required

1661
01:11:50,822 --> 01:11:53,600
before we process the data
using our data frames.

1662
01:11:54,200 --> 01:11:56,300
Now that we have loaded
all the libraries

1663
01:11:56,300 --> 01:11:59,393
which we required to process
the data using the data frames.

1664
01:11:59,393 --> 01:12:01,914
Let us begin with the creation
of our data frame.

1665
01:12:01,914 --> 01:12:05,000
So we shall create a new data
frame with the name employee

1666
01:12:05,000 --> 01:12:05,935
and load the data

1667
01:12:05,935 --> 01:12:08,300
of the employees present
in an organization.

1668
01:12:08,300 --> 01:12:11,400
The details of the employees
will consist the first name

1669
01:12:11,400 --> 01:12:14,968
the last name and their mail ID
along with their salary.

1670
01:12:14,968 --> 01:12:18,500
So the First Data frame is
been successfully created now,

1671
01:12:18,500 --> 01:12:20,700
let us design the schema
for this data frame.

1672
01:12:21,600 --> 01:12:24,100
So the schema for this data
frame is been described

1673
01:12:24,100 --> 01:12:27,900
as shown the first name is of
string data type and similarly.

1674
01:12:27,900 --> 01:12:29,900
The last name is
a string data type

1675
01:12:29,900 --> 01:12:31,500
along with the mail address.

1676
01:12:31,500 --> 01:12:34,500
And finally the salary
is integer data type

1677
01:12:34,500 --> 01:12:37,000
or you can give
flow data type also,

1678
01:12:37,000 --> 01:12:39,882
so the schema has been
successfully delivered now,

1679
01:12:39,882 --> 01:12:41,600
let us create
the data frame using

1680
01:12:41,600 --> 01:12:43,700
Create data frame function here.

1681
01:12:43,700 --> 01:12:47,260
I'm creating a new data frame
by starting a spark context

1682
01:12:47,260 --> 01:12:50,200
and using the create
data frame method and loading

1683
01:12:50,200 --> 01:12:52,800
the data from Employee
and employer schema.

1684
01:12:52,800 --> 01:12:55,200
The data frame is
successfully created now,

1685
01:12:55,200 --> 01:12:56,200
let's print the data

1686
01:12:56,200 --> 01:12:59,353
which is existing
in the dataframe EMP DF.

1687
01:13:00,273 --> 01:13:02,426
I am using show method here.

1688
01:13:03,200 --> 01:13:03,907
So the data

1689
01:13:03,907 --> 01:13:07,700
which is present in EMB DF is
been successfully printed now,

1690
01:13:07,700 --> 01:13:09,600
let us move on to the next step.

1691
01:13:09,800 --> 01:13:12,800
So the next step for our today's
discussion is working

1692
01:13:12,800 --> 01:13:15,500
with an example related
to the FIFA data set.

1693
01:13:16,100 --> 01:13:18,217
So the first step
in our FIFA example

1694
01:13:18,217 --> 01:13:20,772
would be loading the schema
for the CSV file.

1695
01:13:20,772 --> 01:13:22,000
We are working with so

1696
01:13:22,000 --> 01:13:24,400
the schema has been
successfully loaded now.

1697
01:13:24,400 --> 01:13:28,066
Now let us load the CSV file
from our external storage

1698
01:13:28,066 --> 01:13:30,600
which is hdfs
into our data frame,

1699
01:13:30,600 --> 01:13:31,907
which is FIFA DF.

1700
01:13:32,100 --> 01:13:34,394
The CSV file is been
successfully loaded

1701
01:13:34,394 --> 01:13:35,800
into our new data frame,

1702
01:13:35,800 --> 01:13:37,100
which is FIFA DF now,

1703
01:13:37,100 --> 01:13:39,300
let us print the schema
of a data frame using

1704
01:13:39,300 --> 01:13:40,900
the print schema command.

1705
01:13:41,900 --> 01:13:43,400
So the schema
is been successfully

1706
01:13:43,400 --> 01:13:46,000
displayed here and we have
the following credentials.

1707
01:13:46,000 --> 01:13:49,300
Of each and every player
in our CSV file now,

1708
01:13:49,300 --> 01:13:51,900
let's move on to a further
operations on a dataframe.

1709
01:13:53,100 --> 01:13:56,200
We will count the total number
of records of the play

1710
01:13:56,200 --> 01:13:59,100
as we have in our CSV file
using count command.

1711
01:13:59,300 --> 01:14:01,500
So we have a total
of eighteen thousand

1712
01:14:01,500 --> 01:14:04,300
to not seven players
in our CSV files.

1713
01:14:04,300 --> 01:14:06,091
Now, let us find out the details

1714
01:14:06,091 --> 01:14:08,500
of the columns on which
we are working with.

1715
01:14:08,500 --> 01:14:11,300
So these were the columns
which we are working with which

1716
01:14:11,300 --> 01:14:15,466
consists the idea of the player
name age nationality potential

1717
01:14:15,466 --> 01:14:16,400
and many more.

1718
01:14:17,100 --> 01:14:19,600
Now let us use the column value

1719
01:14:19,600 --> 01:14:21,282
which has the value of each

1720
01:14:21,282 --> 01:14:23,900
and every player
for a particular T and let

1721
01:14:23,900 --> 01:14:27,399
us use describe command in order
to see the highest value

1722
01:14:27,399 --> 01:14:29,900
and the least value
provided to a player.

1723
01:14:29,900 --> 01:14:33,000
So we have account
of a total number of 18,000

1724
01:14:33,000 --> 01:14:34,400
to not seven players

1725
01:14:34,400 --> 01:14:37,612
and the minimum worth
given to a player is 0

1726
01:14:37,612 --> 01:14:40,900
and the maximum is given
as 9 million pounds.

1727
01:14:41,100 --> 01:14:43,100
Now, let us use
the select command

1728
01:14:43,100 --> 01:14:46,216
in order to extract
the column name and nationality.

1729
01:14:46,216 --> 01:14:48,172
How to find out the name of each

1730
01:14:48,172 --> 01:14:50,800
and every player along
with his nationality.

1731
01:14:51,000 --> 01:14:54,226
So here we have we can display
the top 20 rows of each

1732
01:14:54,226 --> 01:14:55,200
and every player

1733
01:14:55,200 --> 01:14:58,900
which we have in our CSV file
along with us nationality.

1734
01:14:59,000 --> 01:14:59,700
Similarly.

1735
01:14:59,700 --> 01:15:03,200
Let us find out the players
playing for a particular Club.

1736
01:15:03,200 --> 01:15:05,500
So here we have
the top 20 Place playing

1737
01:15:05,500 --> 01:15:07,029
for their respective clubs

1738
01:15:07,029 --> 01:15:08,300
along with their names

1739
01:15:08,300 --> 01:15:10,800
for example messy
playing for Barcelona

1740
01:15:10,800 --> 01:15:13,100
and Ronaldo for
Juventus and Etc.

1741
01:15:13,100 --> 01:15:15,100
Now, let's move
to the next stages.

1742
01:15:15,999 --> 01:15:17,900
No, let us find out the players

1743
01:15:18,000 --> 01:15:21,000
who are found to be most active
in a particular national team

1744
01:15:21,000 --> 01:15:24,500
or a particular club
with h less than 30 years.

1745
01:15:24,500 --> 01:15:25,300
We shall use

1746
01:15:25,300 --> 01:15:28,300
filter transformation
to apply this operation.

1747
01:15:28,600 --> 01:15:30,500
So here we have the details

1748
01:15:30,500 --> 01:15:33,300
of the Players whose age
is less than 30 years

1749
01:15:33,300 --> 01:15:37,200
and their club and nationality
along with their jersey numbers.

1750
01:15:37,700 --> 01:15:40,700
So with this we have finished
our FIFA example now

1751
01:15:40,700 --> 01:15:43,466
to understand the data frames
in a much better way,

1752
01:15:43,466 --> 01:15:45,300
let us move on
into our use case,

1753
01:15:45,300 --> 01:15:48,400
which is about the most Hot
Topic The Game of Thrones.

1754
01:15:49,100 --> 01:15:51,319
Similar to our previous example,

1755
01:15:51,319 --> 01:15:54,300
let us design the schema
of a CSV file first.

1756
01:15:54,300 --> 01:15:56,600
So this is the schema
for a CSV file

1757
01:15:56,600 --> 01:15:59,300
which consists the data
about the Game of Thrones.

1758
01:15:59,800 --> 01:16:02,800
So, this is a schema
for our first CSV file.

1759
01:16:02,800 --> 01:16:06,200
Now, let us create the schema
for our next CSV file.

1760
01:16:06,700 --> 01:16:09,991
I have named the schema
for our next CSV file a schema

1761
01:16:09,991 --> 01:16:12,667
to and I've defined
the data types for each

1762
01:16:12,667 --> 01:16:16,300
and every entity the scheme
has been successfully designed

1763
01:16:16,300 --> 01:16:18,300
for the second CSV file also.

1764
01:16:18,300 --> 01:16:21,700
Now let us load our CSV files
from our external storage,

1765
01:16:21,700 --> 01:16:23,200
which is our hdfs.

1766
01:16:24,000 --> 01:16:28,100
The location of the first CSV
file character deaths dot CSV

1767
01:16:28,100 --> 01:16:29,076
is our hdfs,

1768
01:16:29,076 --> 01:16:31,000
which is defined as above

1769
01:16:31,000 --> 01:16:33,303
and the schema is been
provided as schema.

1770
01:16:33,303 --> 01:16:35,919
And the header true option
is also been provided.

1771
01:16:35,919 --> 01:16:38,100
We are using spark
read function for this

1772
01:16:38,100 --> 01:16:40,789
and we are loading this data
into our new data frame,

1773
01:16:40,789 --> 01:16:42,600
which is Game
of Thrones data frame.

1774
01:16:42,800 --> 01:16:43,700
Similarly.

1775
01:16:43,700 --> 01:16:45,743
Let's load the other CSV file

1776
01:16:45,743 --> 01:16:49,232
which is battles dot CSV
into another data frame,

1777
01:16:49,232 --> 01:16:53,000
which is Game of Thrones
Butters dataframe the CSV file.

1778
01:16:53,000 --> 01:16:54,792
Has been successfully
loaded now.

1779
01:16:54,792 --> 01:16:57,200
Let us continue
with the further operations.

1780
01:16:57,900 --> 01:17:00,207
Now let us print
the schema offer Game

1781
01:17:00,207 --> 01:17:03,200
of Thrones data frame using
print schema command.

1782
01:17:03,300 --> 01:17:04,962
So here we have the schema

1783
01:17:04,962 --> 01:17:07,200
which consists of
the name alliances

1784
01:17:07,200 --> 01:17:10,821
death rate book of death
and many more similarly.

1785
01:17:10,821 --> 01:17:15,100
Let's print the schema of Game
of Thrones Butters data frame.

1786
01:17:16,300 --> 01:17:18,600
So this is a schema
for our new data frame,

1787
01:17:18,600 --> 01:17:20,700
which is Game of Thrones
battle data frame.

1788
01:17:20,900 --> 01:17:23,600
Now, let's continue
the further operations.

1789
01:17:24,100 --> 01:17:26,000
Now, let us display
the data frame

1790
01:17:26,000 --> 01:17:29,500
which we have created using
the following command data frame

1791
01:17:29,500 --> 01:17:32,188
has been successfully printed
and this is the data

1792
01:17:32,188 --> 01:17:33,813
which we have in our data frame.

1793
01:17:33,813 --> 01:17:36,200
Now, let's continue
with the further operations.

1794
01:17:36,400 --> 01:17:38,449
We know that there are
a multiple number

1795
01:17:38,449 --> 01:17:41,100
of houses present in the story
of Game of Thrones.

1796
01:17:41,100 --> 01:17:42,211
Now, let us find out

1797
01:17:42,211 --> 01:17:45,100
each and every individual house
present in the story.

1798
01:17:45,300 --> 01:17:48,200
Let us use the following command
in order to display each

1799
01:17:48,200 --> 01:17:51,400
and every house present
in the Game of Thrones story.

1800
01:17:51,600 --> 01:17:54,600
So we have the following houses
in the Game of Thrones story.

1801
01:17:54,600 --> 01:17:57,064
Now, let's continue
with the further operations

1802
01:17:57,064 --> 01:18:00,299
the battles in the Game
of Thrones were fought for ages.

1803
01:18:00,299 --> 01:18:02,000
Let us classify the vast waste

1804
01:18:02,000 --> 01:18:04,300
with their occurrence
according to the years.

1805
01:18:04,300 --> 01:18:06,800
We shall use select
and filter transformation

1806
01:18:06,800 --> 01:18:09,750
and we shall access The Columns
of the details of the battle

1807
01:18:09,750 --> 01:18:11,600
and the year in which
they were fought.

1808
01:18:12,100 --> 01:18:13,800
Let us first find
out the battles

1809
01:18:13,800 --> 01:18:15,300
which were fought in the year.

1810
01:18:15,300 --> 01:18:18,000
R 298 the following
code consists of

1811
01:18:18,000 --> 01:18:19,300
filter transformation

1812
01:18:19,300 --> 01:18:22,000
which will provide the details
for which we are looking.

1813
01:18:22,000 --> 01:18:23,350
So according to the result.

1814
01:18:23,350 --> 01:18:25,400
These were the battles
were fought in the year

1815
01:18:25,400 --> 01:18:28,700
298 and we have the details
of the attacker Kings

1816
01:18:28,700 --> 01:18:30,002
and the defender Kings

1817
01:18:30,002 --> 01:18:33,648
and the outcome of the attacker
along with their commanders

1818
01:18:33,648 --> 01:18:36,400
and the location
where the war was fought now,

1819
01:18:36,400 --> 01:18:39,861
let us find out the wars
based in the air 299.

1820
01:18:40,400 --> 01:18:41,764
So these with the details

1821
01:18:41,764 --> 01:18:45,293
of the verse which were fought
in the year 299 and similarly,

1822
01:18:45,293 --> 01:18:48,600
let us also find out the bars
which are waged in the year 300.

1823
01:18:48,600 --> 01:18:49,952
So these were the words

1824
01:18:49,952 --> 01:18:51,700
which were fought
in the year 300.

1825
01:18:51,700 --> 01:18:53,700
Now, let's move on
to the next operations

1826
01:18:53,700 --> 01:18:54,700
in our use case.

1827
01:18:55,000 --> 01:18:58,005
Now, let us find out the tactics
used in the wars waged

1828
01:18:58,005 --> 01:19:01,343
and also find out the total
number of vast waste by using

1829
01:19:01,343 --> 01:19:05,200
each type of those tactics
the following code must help us.

1830
01:19:05,800 --> 01:19:07,200
Here we are using select

1831
01:19:07,200 --> 01:19:10,196
and group by operations
in order to find out each

1832
01:19:10,196 --> 01:19:12,500
and every type of tactics
used in the war.

1833
01:19:12,600 --> 01:19:16,221
So they have used Ambush sees
raising and Pitch type

1834
01:19:16,221 --> 01:19:17,500
of tactics inverse

1835
01:19:17,500 --> 01:19:20,300
and most of the times they
have used pitched battle type

1836
01:19:20,300 --> 01:19:21,600
of tactics inverse.

1837
01:19:21,600 --> 01:19:24,600
Now, let us continue
with the further operations

1838
01:19:24,600 --> 01:19:27,300
the Ambush type of battles are
the deadliest now,

1839
01:19:27,300 --> 01:19:28,650
let us find out the Kings

1840
01:19:28,650 --> 01:19:31,397
who fought the battles
using these kind of tactics

1841
01:19:31,397 --> 01:19:34,200
and also let us find out
the outcome of the battles

1842
01:19:34,200 --> 01:19:37,425
fought here the In code
will help us extract the data

1843
01:19:37,425 --> 01:19:38,600
which we need here.

1844
01:19:38,600 --> 01:19:40,962
We are using select
and we're commands

1845
01:19:40,962 --> 01:19:43,900
and we are selecting
The Columns year attacking

1846
01:19:43,900 --> 01:19:48,181
Defender King attacker outcome
battle type attacker Commander

1847
01:19:48,181 --> 01:19:49,840
defend the commander now,

1848
01:19:49,840 --> 01:19:51,500
let us print the details.

1849
01:19:51,900 --> 01:19:54,700
So these were the battles
fought using the Ambush tactics

1850
01:19:54,700 --> 01:19:56,300
and these were
the attacker Kings

1851
01:19:56,300 --> 01:19:59,300
and the defender Kings along
with their respective commanders

1852
01:19:59,300 --> 01:20:01,641
and the wars waste
in a particular year now.

1853
01:20:01,641 --> 01:20:03,700
Let's move on
to the next operation.

1854
01:20:04,300 --> 01:20:06,000
Now let us focus on the houses

1855
01:20:06,000 --> 01:20:08,600
and extract the deadliest house
amongst the rest.

1856
01:20:08,600 --> 01:20:11,893
The following code will help us
to find out the deadliest house

1857
01:20:11,893 --> 01:20:13,700
and the number
of patents the wage.

1858
01:20:13,700 --> 01:20:16,600
So here we have the details
of each and every house

1859
01:20:16,600 --> 01:20:19,383
and the battles the waged
according to the results.

1860
01:20:19,383 --> 01:20:20,033
We have stuck

1861
01:20:20,033 --> 01:20:22,883
and Lannister houses to be
the deadliest among the others.

1862
01:20:22,883 --> 01:20:25,400
Now, let's continue
with the rest of the operations.

1863
01:20:25,900 --> 01:20:28,100
Now, let us find out
the deadliest king

1864
01:20:28,100 --> 01:20:29,100
among the others

1865
01:20:29,100 --> 01:20:31,400
which will use the following
command in order to find

1866
01:20:31,400 --> 01:20:33,600
the deadliest king
amongst the other kings

1867
01:20:33,600 --> 01:20:35,600
who fought in the A
number of Firsts.

1868
01:20:35,600 --> 01:20:38,000
So according to the results
we have Joffrey as

1869
01:20:38,000 --> 01:20:38,900
the deadliest King

1870
01:20:38,900 --> 01:20:41,200
who fought a total number
of 14 battles.

1871
01:20:41,200 --> 01:20:44,000
Now, let us continue
with the further operations.

1872
01:20:44,500 --> 01:20:46,323
Now, let us find out the houses

1873
01:20:46,323 --> 01:20:49,400
which defended most number
of Wars waste against them.

1874
01:20:49,400 --> 01:20:52,500
So the following code must help
us find out the details.

1875
01:20:52,600 --> 01:20:54,223
So according to the results.

1876
01:20:54,223 --> 01:20:57,400
We have Lannister house
to be defending the most number

1877
01:20:57,400 --> 01:20:59,009
of paths based against them.

1878
01:20:59,009 --> 01:21:01,682
Now, let us find out
the defender King who defend

1879
01:21:01,682 --> 01:21:04,900
it most number of battles
which were waste against him

1880
01:21:05,400 --> 01:21:08,405
So according to the result drop
stack is the king

1881
01:21:08,405 --> 01:21:10,597
who defended most
number of patterns

1882
01:21:10,597 --> 01:21:12,100
which waged against him.

1883
01:21:12,100 --> 01:21:12,300
Now.

1884
01:21:12,300 --> 01:21:14,600
Let's continue with
the further operations.

1885
01:21:14,800 --> 01:21:17,300
Since Lannister house
is my personal favorite.

1886
01:21:17,300 --> 01:21:18,800
Let me find out the details

1887
01:21:18,800 --> 01:21:20,800
of the characters
in Lannister house.

1888
01:21:20,800 --> 01:21:22,921
This code will
describe their name

1889
01:21:22,921 --> 01:21:24,400
and gender one for male

1890
01:21:24,400 --> 01:21:27,700
and 0 for female along with
their respective population.

1891
01:21:27,700 --> 01:21:29,830
So let me find out
the male characters

1892
01:21:29,830 --> 01:21:31,500
in The Lannister house first.

1893
01:21:32,300 --> 01:21:34,899
So here we have used select
and we're commanded.

1894
01:21:34,900 --> 01:21:37,600
Ends in order to find out
the details of the characters

1895
01:21:37,600 --> 01:21:39,100
present in Lannister house

1896
01:21:39,100 --> 01:21:42,300
and the data is been stored
into tf1 dataframe.

1897
01:21:42,300 --> 01:21:44,700
Let us print the data
which is present in idea

1898
01:21:44,700 --> 01:21:46,900
of one data frame
using show command.

1899
01:21:47,800 --> 01:21:49,000
So these are the details

1900
01:21:49,000 --> 01:21:51,400
of the characters
present in Lannister house,

1901
01:21:51,400 --> 01:21:53,100
which are made now similarly.

1902
01:21:53,100 --> 01:21:55,400
Let us find out the female
character is present

1903
01:21:55,400 --> 01:21:56,800
in Lannister house.

1904
01:21:57,500 --> 01:22:00,000
So these are the characters
present in Lannister house

1905
01:22:00,000 --> 01:22:01,100
who are females

1906
01:22:01,300 --> 01:22:05,028
so we have a total number of
69 male characters and 12 number

1907
01:22:05,028 --> 01:22:07,900
of female characters
in The Lannister house.

1908
01:22:07,900 --> 01:22:11,311
Now, let us continue with
the next operations at the end

1909
01:22:11,311 --> 01:22:12,800
of the day every episode

1910
01:22:12,800 --> 01:22:14,800
of Game of Thrones had
a noble character.

1911
01:22:15,000 --> 01:22:17,365
Let us now find out all
the noble characters

1912
01:22:17,365 --> 01:22:18,664
amongst all the houses

1913
01:22:18,664 --> 01:22:21,193
that we have in our Game
of Thrones CSV file

1914
01:22:21,193 --> 01:22:24,100
the following code must help
us find out the details.

1915
01:22:25,600 --> 01:22:26,300
So the details

1916
01:22:26,300 --> 01:22:28,500
of all the characters
from all the houses

1917
01:22:28,500 --> 01:22:30,050
who are considered to be Noble.

1918
01:22:30,050 --> 01:22:32,200
I've been saved
into the new data frame,

1919
01:22:32,200 --> 01:22:33,427
which is DF 3 now,

1920
01:22:33,427 --> 01:22:36,800
let us print the details
from the df3 data frame.

1921
01:22:37,500 --> 01:22:40,000
So these are the top 20 members
from all the houses

1922
01:22:40,000 --> 01:22:42,900
who are considered to be Noble
along with their genders.

1923
01:22:42,900 --> 01:22:45,400
Now, let us count the total
number of noble characters

1924
01:22:45,400 --> 01:22:47,600
from the entire game
of thrones stories.

1925
01:22:48,300 --> 01:22:50,500
So there are a total
of four hundred and thirty

1926
01:22:50,500 --> 01:22:53,300
number of noble characters
existing in the whole game

1927
01:22:53,300 --> 01:22:54,300
of throne story.

1928
01:22:54,800 --> 01:22:56,211
Nonetheless, we have also

1929
01:22:56,211 --> 01:22:59,086
faced a few Communists
whose role in The Game

1930
01:22:59,086 --> 01:23:01,700
of Thrones is found
to be exceptional vision

1931
01:23:01,700 --> 01:23:04,219
of find out the details
of all those commoners

1932
01:23:04,219 --> 01:23:07,300
who were highly dedicated
to their roles in each episode

1933
01:23:07,600 --> 01:23:08,700
the data of all,

1934
01:23:08,700 --> 01:23:10,700
the commoners is
been successfully loaded

1935
01:23:10,700 --> 01:23:11,900
into the new data frame,

1936
01:23:11,900 --> 01:23:14,202
which is TFO now let
us print the data

1937
01:23:14,202 --> 01:23:17,500
which is present in the DF
for using the show command.

1938
01:23:17,900 --> 01:23:20,396
So these are the top
20 characters identified as

1939
01:23:20,396 --> 01:23:23,004
common as amongst all the Game
of Thrones stories.

1940
01:23:23,004 --> 01:23:25,400
Now, let us find out
the count of total number

1941
01:23:25,400 --> 01:23:26,600
of common characters.

1942
01:23:26,700 --> 01:23:27,649
So there are a total

1943
01:23:27,649 --> 01:23:30,099
of four hundred and
eighty seven common characters

1944
01:23:30,099 --> 01:23:32,000
amongst all stories
of Game of Thrones.

1945
01:23:32,000 --> 01:23:34,100
Let us continue
with the further operations.

1946
01:23:34,100 --> 01:23:35,700
Now they were a few rows

1947
01:23:35,700 --> 01:23:37,700
who were considered
to be important

1948
01:23:37,700 --> 01:23:39,210
and equally Noble, hence.

1949
01:23:39,210 --> 01:23:41,526
They were carried out
under the last book.

1950
01:23:41,526 --> 01:23:43,644
So let us filter
out those characters

1951
01:23:43,644 --> 01:23:46,100
and find out the details
of each one of them.

1952
01:23:46,400 --> 01:23:49,520
The data of all the characters
who are considered to be Noble

1953
01:23:49,520 --> 01:23:50,300
and carried out

1954
01:23:50,300 --> 01:23:53,300
until the last book are being
stored into the new data frame,

1955
01:23:53,300 --> 01:23:55,629
which is TFO now let
us print the data

1956
01:23:55,629 --> 01:23:56,652
which is existing

1957
01:23:56,652 --> 01:23:59,600
in the data frame for so
according to the results.

1958
01:23:59,600 --> 01:24:00,650
We have two candidates

1959
01:24:00,650 --> 01:24:03,300
who are considered to be
the noble and their character

1960
01:24:03,300 --> 01:24:05,200
is been carried on
until the last book

1961
01:24:05,700 --> 01:24:06,900
amongst all the battles.

1962
01:24:06,900 --> 01:24:09,068
I found the battles
of the last books

1963
01:24:09,068 --> 01:24:11,900
to be generating more
adrenaline in the readers.

1964
01:24:11,900 --> 01:24:14,500
Let us find out the details
of those battles using

1965
01:24:14,500 --> 01:24:15,600
the following code.

1966
01:24:16,000 --> 01:24:18,700
So the following code will help
us to find out the bars

1967
01:24:18,700 --> 01:24:20,500
which were fought
in the last year's

1968
01:24:20,500 --> 01:24:21,700
of the Game of Thrones.

1969
01:24:22,100 --> 01:24:24,799
So these are the details
of the vast which are fought

1970
01:24:24,799 --> 01:24:26,800
in the last year's
of the Game of Thrones

1971
01:24:26,800 --> 01:24:28,200
and the details of the Kings

1972
01:24:28,300 --> 01:24:30,067
and the details
of their commanders

1973
01:24:30,067 --> 01:24:32,200
and the location
where the war was fought.

1974
01:24:36,700 --> 01:24:40,579
Welcome to this interesting
session of Sparks SQL tutorial

1975
01:24:40,579 --> 01:24:41,600
from a drecker.

1976
01:24:41,600 --> 01:24:42,700
So in today's session,

1977
01:24:42,700 --> 01:24:46,100
we are going to learn about
how we will be working.

1978
01:24:46,100 --> 01:24:48,500
Spock sequent now what all you

1979
01:24:48,500 --> 01:24:51,944
can expect from this course
from this particular session

1980
01:24:51,944 --> 01:24:53,300
so you can expect that.

1981
01:24:53,300 --> 01:24:56,400
We will be first learning
by Sparks equal.

1982
01:24:56,500 --> 01:24:58,139
What are the libraries

1983
01:24:58,139 --> 01:25:00,600
which are present
in Sparks equal.

1984
01:25:00,600 --> 01:25:03,600
What are the important
features of Sparkle?

1985
01:25:03,600 --> 01:25:06,400
We will also be doing
some Hands-On example

1986
01:25:06,400 --> 01:25:10,323
and in the end we will see
some interesting use case

1987
01:25:10,323 --> 01:25:13,300
of stock market analysis now

1988
01:25:13,400 --> 01:25:15,042
Rice Park sequel is it

1989
01:25:15,042 --> 01:25:19,200
like Why we are learning it
why it is really important

1990
01:25:19,200 --> 01:25:22,067
for us to know about
this Sparks equal sign.

1991
01:25:22,067 --> 01:25:24,200
Is it like really hot in Market?

1992
01:25:24,200 --> 01:25:27,700
If yes, then why we want
all those answer from this.

1993
01:25:27,700 --> 01:25:30,500
So if you're coming
from her do background,

1994
01:25:30,500 --> 01:25:34,102
you must have heard a lot
about Apache Hive now

1995
01:25:34,300 --> 01:25:36,100
what happens in Apache.

1996
01:25:36,100 --> 01:25:39,061
I also like in Apache
Hive SQL developers

1997
01:25:39,061 --> 01:25:41,430
can write the queries in SQL way

1998
01:25:41,430 --> 01:25:43,800
and it will be getting converted

1999
01:25:43,800 --> 01:25:45,800
to your mapreduce
and giving you the out.

2000
01:25:46,400 --> 01:25:47,600
Now we all know

2001
01:25:47,600 --> 01:25:50,000
that mapreduce is
lower in nature.

2002
01:25:50,000 --> 01:25:52,726
And since mapreduce
is going to be slower

2003
01:25:52,726 --> 01:25:54,500
and nature then definitely

2004
01:25:54,500 --> 01:25:58,000
your overall high score
is going to be slower in nature.

2005
01:25:58,000 --> 01:25:59,537
So that was one challenge.

2006
01:25:59,537 --> 01:26:02,361
So if you have let's say
less than 200 GB of data

2007
01:26:02,361 --> 01:26:04,400
or if you have
a smaller set of data.

2008
01:26:04,400 --> 01:26:06,800
This was actually
a big challenge

2009
01:26:06,800 --> 01:26:10,400
that in Hive your performance
was not that great.

2010
01:26:10,400 --> 01:26:13,900
It also do not have
any resuming capability stuck.

2011
01:26:13,900 --> 01:26:15,900
You can just start it also.

2012
01:26:15,900 --> 01:26:19,200
- cannot even drop
your encrypted data bases.

2013
01:26:19,200 --> 01:26:21,082
That's was also one
of the challenge

2014
01:26:21,082 --> 01:26:23,200
when you deal with
the security side.

2015
01:26:23,200 --> 01:26:25,082
Now what sparks equal have done

2016
01:26:25,082 --> 01:26:28,300
it Sparks equal have solved
almost all of the problem.

2017
01:26:28,300 --> 01:26:31,064
So in the last sessions
you have already learned

2018
01:26:31,064 --> 01:26:34,500
about the smart way right House
Park is faster from mapreduce

2019
01:26:34,500 --> 01:26:36,200
and not we have already learned

2020
01:26:36,200 --> 01:26:38,800
that in the previous
few sessions now.

2021
01:26:38,800 --> 01:26:39,917
So in this session,

2022
01:26:39,917 --> 01:26:43,000
we are going to kind of take
a live range of all that so

2023
01:26:43,000 --> 01:26:44,800
definitely in this case

2024
01:26:44,800 --> 01:26:47,500
since This pack is
faster because of

2025
01:26:47,500 --> 01:26:49,200
the in-memory computation.

2026
01:26:49,200 --> 01:26:50,866
What is in memory competition?

2027
01:26:50,866 --> 01:26:52,200
We have already seen it.

2028
01:26:52,200 --> 01:26:55,105
So in memory computations
is like whenever we

2029
01:26:55,105 --> 01:26:57,700
are Computing anything
in memory directly.

2030
01:26:57,700 --> 01:27:01,165
So because of in memory
competition capability because

2031
01:27:01,165 --> 01:27:02,800
of arches purpose poster.

2032
01:27:02,800 --> 01:27:07,500
So definitely your spark SQL is
also been to become first know

2033
01:27:07,500 --> 01:27:08,600
so if I talk

2034
01:27:08,600 --> 01:27:11,900
about the advantages
of Sparks equal over Hive

2035
01:27:11,900 --> 01:27:14,970
definitely number one it
is going to be faster

2036
01:27:14,970 --> 01:27:17,900
in Listen to your hive
so a high quality,

2037
01:27:17,900 --> 01:27:20,900
which is let's say
you're taking around 10 minutes

2038
01:27:20,900 --> 01:27:21,905
in Sparks equal.

2039
01:27:21,905 --> 01:27:25,300
You can finish that same query
in less than one minute.

2040
01:27:25,300 --> 01:27:27,400
Don't you think it's
an awesome capability

2041
01:27:27,400 --> 01:27:31,400
of subsequent definitely as
right now second thing is

2042
01:27:31,400 --> 01:27:34,400
when if let's say you
are writing something and -

2043
01:27:34,400 --> 01:27:36,148
now you can take an example

2044
01:27:36,148 --> 01:27:39,751
of let's say a company
who is let's say developing -

2045
01:27:39,751 --> 01:27:41,467
queries from last 10 years.

2046
01:27:41,467 --> 01:27:42,900
Now they were doing it.

2047
01:27:42,900 --> 01:27:44,000
There were all happy

2048
01:27:44,000 --> 01:27:46,000
that they were able
to process picture.

2049
01:27:46,100 --> 01:27:48,200
That they were worried
about the performance

2050
01:27:48,200 --> 01:27:50,778
that Hive is not able
to give them a that level

2051
01:27:50,778 --> 01:27:53,273
of processing speed what
they are looking for.

2052
01:27:53,273 --> 01:27:54,160
Now this fossil.

2053
01:27:54,160 --> 01:27:56,600
It's a challenge
for that particular company.

2054
01:27:56,600 --> 01:27:58,801
Now, there's a challenge right?

2055
01:27:58,801 --> 01:28:01,397
The challenge is
they came to know know

2056
01:28:01,397 --> 01:28:02,900
about subsequent fine.

2057
01:28:02,900 --> 01:28:04,685
Let's say we came
to know about it,

2058
01:28:04,685 --> 01:28:05,853
but they came to know

2059
01:28:05,853 --> 01:28:08,300
that we can execute
everything is Park Sequel

2060
01:28:08,300 --> 01:28:10,700
and it is going to be
faster as well fine.

2061
01:28:10,700 --> 01:28:12,281
But don't you think that

2062
01:28:12,281 --> 01:28:15,708
if these companies working
for net set past 10 years?

2063
01:28:15,708 --> 01:28:19,200
In Hive they must have already
written lot of Gordon -

2064
01:28:19,200 --> 01:28:23,100
now if you ask them to migrate
to spark SQL is will it be

2065
01:28:23,100 --> 01:28:24,400
until easy task?

2066
01:28:24,400 --> 01:28:25,200
No, right.

2067
01:28:25,200 --> 01:28:25,982
Definitely.

2068
01:28:25,982 --> 01:28:28,384
It is not going
to be an easy task.

2069
01:28:28,384 --> 01:28:32,200
Why because Hive syntax
and Sparks equals and X though.

2070
01:28:32,200 --> 01:28:35,800
They boot tackle the sequel way
of writing the things

2071
01:28:35,800 --> 01:28:39,346
but at the same time
it is always a very

2072
01:28:39,346 --> 01:28:41,500
it carries a big difference,

2073
01:28:41,500 --> 01:28:44,300
so there will be a good
difference whenever we talk

2074
01:28:44,300 --> 01:28:45,905
about the syntax between them.

2075
01:28:45,905 --> 01:28:48,100
So it will take a very
good amount of time

2076
01:28:48,100 --> 01:28:51,017
for that company to change
all of the query mode

2077
01:28:51,017 --> 01:28:54,052
to the Sparks equal way
now Sparks equal came up

2078
01:28:54,052 --> 01:28:55,426
with a smart salvation

2079
01:28:55,426 --> 01:28:56,899
what they said is even

2080
01:28:56,899 --> 01:28:58,900
if you are writing
the query with -

2081
01:28:58,900 --> 01:29:01,300
you can execute
that Hive query directly

2082
01:29:01,300 --> 01:29:03,500
through subsequent don't you
think it's again

2083
01:29:03,500 --> 01:29:06,600
a very important
and awesome facility, right?

2084
01:29:06,600 --> 01:29:09,900
Because even now
if you're a good Hive developer,

2085
01:29:09,900 --> 01:29:12,000
you need not worry about

2086
01:29:12,000 --> 01:29:15,600
that how you will be now
that migrating to Sparks.

2087
01:29:15,600 --> 01:29:18,658
Well, you can still keep on
writing to the hive query

2088
01:29:18,658 --> 01:29:20,900
and can your query
will automatically be

2089
01:29:20,900 --> 01:29:24,767
getting converted to spot sequel
with similarly in Apache spark

2090
01:29:24,767 --> 01:29:27,200
as we have learned
in the past sessions,

2091
01:29:27,200 --> 01:29:30,100
especially through spark
streaming that Sparks.

2092
01:29:30,100 --> 01:29:33,600
The aiming is going to make
you real time processing right?

2093
01:29:33,600 --> 01:29:36,000
You can also perform
your real-time processing

2094
01:29:36,000 --> 01:29:37,615
using a purchase. / now.

2095
01:29:37,615 --> 01:29:39,500
This sort of facility is you

2096
01:29:39,500 --> 01:29:41,800
can take leverage even
you know Sparks ago.

2097
01:29:41,800 --> 01:29:44,235
So let's say you can do
a real-time processing

2098
01:29:44,235 --> 01:29:46,400
and at the same time
you can also Perform

2099
01:29:46,400 --> 01:29:47,860
your SQL query now the type

2100
01:29:47,860 --> 01:29:49,120
that was the problem.

2101
01:29:49,120 --> 01:29:49,900
You cannot do

2102
01:29:49,900 --> 01:29:52,900
that because when we talk
about Hive now in -

2103
01:29:52,900 --> 01:29:54,320
it's all about Hadoop is

2104
01:29:54,320 --> 01:29:56,663
all about batch
processing batch processing

2105
01:29:56,663 --> 01:29:58,509
where you keep historical data

2106
01:29:58,509 --> 01:30:00,736
and then later you
process it, right?

2107
01:30:00,736 --> 01:30:03,699
So it definitely Hive also
follow the same approach

2108
01:30:03,699 --> 01:30:05,300
in this case also high risk

2109
01:30:05,300 --> 01:30:07,850
going to just only follow
the batch processing mode,

2110
01:30:07,850 --> 01:30:09,600
but when it comes to a purchase,

2111
01:30:09,600 --> 01:30:13,500
but it will also be taking care
of the real-time processing.

2112
01:30:13,500 --> 01:30:15,499
So how all these things happens

2113
01:30:15,499 --> 01:30:18,400
so Our Park sequel always
uses your meta store

2114
01:30:18,400 --> 01:30:21,350
Services of your hive
to query the data stored

2115
01:30:21,350 --> 01:30:22,400
and managed by -

2116
01:30:22,400 --> 01:30:24,728
so in when you were
learning about high,

2117
01:30:24,728 --> 01:30:28,123
so we have learned at that time
that in hives everything.

2118
01:30:28,123 --> 01:30:30,711
What we do is always
stored in the meta Stone

2119
01:30:30,711 --> 01:30:33,491
so that met Esther was
The crucial point, right?

2120
01:30:33,491 --> 01:30:35,200
Because using that meta store

2121
01:30:35,200 --> 01:30:37,600
only you are able
to do everything up.

2122
01:30:37,600 --> 01:30:41,100
So like when you are doing
let's say or any sort of query

2123
01:30:41,100 --> 01:30:42,707
when you're creating a table,

2124
01:30:42,707 --> 01:30:45,700
everything was getting stored
in that same metal Stone.

2125
01:30:45,700 --> 01:30:47,559
What happens Spock sequel

2126
01:30:47,559 --> 01:30:51,800
also use the same metal Stone
now is whatever metal store.

2127
01:30:51,800 --> 01:30:55,051
You have created with respect
to Hive same meta store.

2128
01:30:55,051 --> 01:30:56,219
You can also use it

2129
01:30:56,219 --> 01:30:58,900
for your Sparks equal
and that is something

2130
01:30:58,900 --> 01:31:02,000
which is really awesome
about this spark sequent

2131
01:31:02,000 --> 01:31:04,000
that you did not create
a new meta store.

2132
01:31:04,000 --> 01:31:06,300
You need not worry
about a new storage space

2133
01:31:06,300 --> 01:31:07,404
and not everything

2134
01:31:07,404 --> 01:31:10,820
what you have done with respect
to your high same method

2135
01:31:10,820 --> 01:31:11,620
you can use it.

2136
01:31:11,620 --> 01:31:11,833
Now.

2137
01:31:11,833 --> 01:31:13,700
You can ask me then
how it is faster

2138
01:31:13,700 --> 01:31:15,700
if they're using
cymatics don't remember.

2139
01:31:15,700 --> 01:31:18,500
But the processing part
why high was lower

2140
01:31:18,500 --> 01:31:20,301
because of its processing way

2141
01:31:20,301 --> 01:31:23,519
because it is converting
everything to the mapreduce

2142
01:31:23,519 --> 01:31:26,782
and this it was making
the processing very very slow.

2143
01:31:26,782 --> 01:31:28,100
But here in this case

2144
01:31:28,100 --> 01:31:31,452
since the processing is going
to be in memory computation.

2145
01:31:31,452 --> 01:31:32,705
So in Sparks equal case,

2146
01:31:32,705 --> 01:31:35,588
it is always going to be
the faster now definitely

2147
01:31:35,588 --> 01:31:37,545
it just because of
the meta store site.

2148
01:31:37,545 --> 01:31:39,600
We are only able
to fetch the data are

2149
01:31:39,600 --> 01:31:42,129
not but at the same time
for any other thing

2150
01:31:42,129 --> 01:31:44,100
of the processing related stuff,

2151
01:31:44,100 --> 01:31:46,200
it is always going to be At the

2152
01:31:46,200 --> 01:31:48,180
when we talk about
the processing stage

2153
01:31:48,180 --> 01:31:51,200
it is going to be in memory
does it's going to be faster.

2154
01:31:51,300 --> 01:31:54,335
So let's talk about some success
stories of Sparks equal.

2155
01:31:54,335 --> 01:31:57,550
Let's see some use cases
Twitter sentiment analysis.

2156
01:31:57,550 --> 01:31:58,844
If you go through over

2157
01:31:58,844 --> 01:32:01,699
if you want sexy remember
our spark streaming session,

2158
01:32:01,700 --> 01:32:04,300
we have done a Twitter
sentiment analysis, right?

2159
01:32:04,300 --> 01:32:05,400
So there you have seen

2160
01:32:05,400 --> 01:32:08,497
that we have first initially
got the data from Twitter and

2161
01:32:08,497 --> 01:32:10,400
that to we have got
it with the help

2162
01:32:10,400 --> 01:32:11,911
of Sparks Damon and later

2163
01:32:11,911 --> 01:32:13,000
what we did later.

2164
01:32:13,000 --> 01:32:15,600
We just analyze everything
with the help of spot.

2165
01:32:15,600 --> 01:32:18,080
Oxycodone so you can see
an advantage as possible.

2166
01:32:18,080 --> 01:32:19,761
So in Twitter sentiment analysis

2167
01:32:19,761 --> 01:32:21,600
where let's say
you want to find out

2168
01:32:21,600 --> 01:32:23,200
about the Donald Trump, right?

2169
01:32:23,200 --> 01:32:24,509
You are fetching the data

2170
01:32:24,509 --> 01:32:26,547
every tweet related
to the Donald Trump

2171
01:32:26,547 --> 01:32:28,900
and then kind of bring
analysis in checking

2172
01:32:28,900 --> 01:32:31,200
that whether it's
a positive with negative

2173
01:32:31,200 --> 01:32:32,475
tweet neutral tweet,

2174
01:32:32,475 --> 01:32:34,900
very negative with very
positive to it.

2175
01:32:34,900 --> 01:32:37,257
Okay, so we have already
seen the same example there

2176
01:32:37,257 --> 01:32:38,607
in that particular session.

2177
01:32:38,607 --> 01:32:39,549
So in this session,

2178
01:32:39,549 --> 01:32:40,499
as you are noticing

2179
01:32:40,499 --> 01:32:42,600
what we are doing we
just want to kind of so

2180
01:32:42,600 --> 01:32:44,202
that once you're
streaming the data

2181
01:32:44,202 --> 01:32:45,900
and the real time
you can also do it.

2182
01:32:45,900 --> 01:32:47,977
Also, seeing using
spark sequel just you

2183
01:32:47,977 --> 01:32:50,724
are doing all the processing
at the real time similarly

2184
01:32:50,724 --> 01:32:52,270
in the stock market analysis.

2185
01:32:52,270 --> 01:32:54,295
You can use Park
sequel lot of bullies.

2186
01:32:54,295 --> 01:32:57,400
You can adopt the in the banking
fraud case Transitions and all

2187
01:32:57,400 --> 01:32:58,400
you can use that.

2188
01:32:58,400 --> 01:33:01,000
So let's say your credit
card current is getting swipe

2189
01:33:01,000 --> 01:33:02,580
in India and in next 10 minutes

2190
01:33:02,580 --> 01:33:04,429
if your credit card
is getting swiped

2191
01:33:04,429 --> 01:33:05,456
in let's say in u.s.

2192
01:33:05,456 --> 01:33:07,100
Definitely that is not possible.

2193
01:33:07,100 --> 01:33:07,400
Right?

2194
01:33:07,400 --> 01:33:09,872
So let's say you are doing all
that processing real-time.

2195
01:33:09,872 --> 01:33:12,300
You're detecting everything
with respect to sparsely me.

2196
01:33:12,300 --> 01:33:15,400
Then you are let's say applying
your Sparks equal to verify

2197
01:33:15,400 --> 01:33:18,000
that Whether it's
a user Trend or not, right?

2198
01:33:18,000 --> 01:33:20,600
So all those things you want
to match up as possible.

2199
01:33:20,600 --> 01:33:21,960
So you can do that similarly

2200
01:33:21,960 --> 01:33:23,750
the medical domain
you can use that.

2201
01:33:23,750 --> 01:33:25,949
Let's talk about
some Sparks equal features.

2202
01:33:25,949 --> 01:33:28,200
So there will be
some features related to it.

2203
01:33:28,400 --> 01:33:30,200
Now, you can use

2204
01:33:30,200 --> 01:33:33,700
what happens when this sequel
got combined with this path.

2205
01:33:33,700 --> 01:33:34,830
We started calling it

2206
01:33:34,830 --> 01:33:35,825
as Park sequel now

2207
01:33:35,825 --> 01:33:38,700
when definitely we are talking
about SQL be a talking

2208
01:33:38,700 --> 01:33:40,405
about either a structure data

2209
01:33:40,405 --> 01:33:41,800
or a semi-structured data now

2210
01:33:41,800 --> 01:33:44,231
SQL queries cannot deal
with the unstructured data,

2211
01:33:44,231 --> 01:33:47,300
so that is definitely one of
Thing you need to keep in mind.

2212
01:33:47,300 --> 01:33:51,000
Now your spark sequel also
support various data formats.

2213
01:33:51,000 --> 01:33:52,800
You can get a data from pocket.

2214
01:33:52,800 --> 01:33:54,500
You must have heard about Market

2215
01:33:54,500 --> 01:33:56,911
that it is a columnar
based storage and it

2216
01:33:56,911 --> 01:33:59,884
is kind of very much
compressed format of the data

2217
01:33:59,884 --> 01:34:02,300
what you have but it's
not human readable.

2218
01:34:02,300 --> 01:34:02,800
Similarly.

2219
01:34:02,800 --> 01:34:04,800
You must have heard
about Jason Avro

2220
01:34:04,800 --> 01:34:07,200
where we keep the value
as a key value pair.

2221
01:34:07,200 --> 01:34:08,482
Hi Cassandra, right?

2222
01:34:08,482 --> 01:34:09,700
These are nosql TVs

2223
01:34:09,700 --> 01:34:12,800
so you can get all the data
from these sources now.

2224
01:34:12,800 --> 01:34:15,114
You can also convert
your SQL queries

2225
01:34:15,114 --> 01:34:16,400
to your A derivative

2226
01:34:16,400 --> 01:34:18,650
so you can you can you
will be able to perform

2227
01:34:18,650 --> 01:34:20,113
all the transformation steps.

2228
01:34:20,113 --> 01:34:21,800
So that is one thing you can do.

2229
01:34:21,800 --> 01:34:23,500
Now if we talk about performance

2230
01:34:23,500 --> 01:34:26,700
and scalability definitely
on this red color graph.

2231
01:34:26,700 --> 01:34:29,431
If you notice this
is related to your Hadoop,

2232
01:34:29,431 --> 01:34:30,300
you can notice

2233
01:34:30,300 --> 01:34:34,000
that red color graph is much
more encompassing to blue color

2234
01:34:34,000 --> 01:34:36,617
and blue color denotes
my performance with respect

2235
01:34:36,617 --> 01:34:37,503
to Sparks equal

2236
01:34:37,503 --> 01:34:40,856
so you can notice that spark
SQL is performing much better

2237
01:34:40,856 --> 01:34:42,684
in comparison to your Hadoop.

2238
01:34:42,684 --> 01:34:44,260
So we are on this Y axis.

2239
01:34:44,260 --> 01:34:45,900
We are taking the running.

2240
01:34:46,000 --> 01:34:47,200
On the x-axis.

2241
01:34:47,200 --> 01:34:50,119
We were considering
the number of iteration

2242
01:34:50,119 --> 01:34:53,000
when we talk about
Sparks equal features.

2243
01:34:53,000 --> 01:34:56,000
Now few more features
we have for example,

2244
01:34:56,000 --> 01:34:59,200
you can create a connection
with simple your jdbc driver

2245
01:34:59,200 --> 01:35:00,494
or odbc driver, right?

2246
01:35:00,494 --> 01:35:02,482
These are simple
drivers being present.

2247
01:35:02,482 --> 01:35:03,600
Now, you can create

2248
01:35:03,600 --> 01:35:06,700
your connection with his path
SQL using all these drivers.

2249
01:35:06,700 --> 01:35:10,000
You can also create a user
defined function means let's say

2250
01:35:10,000 --> 01:35:12,200
if any function is
not available to you

2251
01:35:12,200 --> 01:35:14,600
and that gives you can create
your own functions.

2252
01:35:14,600 --> 01:35:16,900
Let's say if function
Is available use

2253
01:35:16,900 --> 01:35:18,639
that if it is not available,

2254
01:35:18,639 --> 01:35:21,497
you can create a UDF means
user-defined function

2255
01:35:21,497 --> 01:35:23,235
and you can directly execute

2256
01:35:23,235 --> 01:35:26,478
that user-defined function
and get your dessert sir.

2257
01:35:26,478 --> 01:35:28,900
So this is one example
where we have shown

2258
01:35:28,900 --> 01:35:30,100
that you can convert.

2259
01:35:30,100 --> 01:35:33,000
Let's say if you don't have
an uppercase API present

2260
01:35:33,000 --> 01:35:36,405
in subsequent how you
can create a simple UDF for a

2261
01:35:36,405 --> 01:35:37,700
and can execute it.

2262
01:35:37,700 --> 01:35:38,850
So if you notice there

2263
01:35:38,850 --> 01:35:41,200
what we are doing
let's get this is my data.

2264
01:35:41,200 --> 01:35:42,700
So if you notice in this case,

2265
01:35:43,069 --> 01:35:45,530
this is data set is
my data part.

2266
01:35:45,800 --> 01:35:48,100
So this is I'm generating
as a sequence.

2267
01:35:48,100 --> 01:35:51,800
I'm creating it as a data frame
see this 2df part here.

2268
01:35:51,800 --> 01:35:55,100
Now after that we
are creating a / U DF here

2269
01:35:55,100 --> 01:35:58,217
and notice we are converting
any value which is coming

2270
01:35:58,217 --> 01:35:59,600
to my upper case, right?

2271
01:35:59,600 --> 01:36:02,000
We are using this to uppercase
API to convert it.

2272
01:36:02,100 --> 01:36:05,800
We are importing this function
and then what we did now

2273
01:36:05,800 --> 01:36:08,100
when we came here,
we are telling that okay.

2274
01:36:08,100 --> 01:36:09,236
This is my UDF.

2275
01:36:09,236 --> 01:36:10,600
So UDF is upper by

2276
01:36:10,600 --> 01:36:12,719
because we have created
here also a zapper.

2277
01:36:12,719 --> 01:36:13,569
So we are telling

2278
01:36:13,569 --> 01:36:16,100
that this is my UDF
in the first step and then Then

2279
01:36:16,100 --> 01:36:17,153
when we are using it,

2280
01:36:17,153 --> 01:36:20,253
let's say with our datasets
what we are doing so data sets.

2281
01:36:20,253 --> 01:36:22,100
We are passing year
that okay, whatever.

2282
01:36:22,100 --> 01:36:23,393
We are doing convert it

2283
01:36:23,393 --> 01:36:26,600
to my upper developer you DFX
convert it to my upper case.

2284
01:36:26,600 --> 01:36:29,100
So see we are telling you
we have created our / UDF

2285
01:36:29,100 --> 01:36:31,500
that is what we are passing
inside this text value.

2286
01:36:31,800 --> 01:36:34,600
So now it is just
getting converted

2287
01:36:34,600 --> 01:36:37,600
and giving you all the output
in your upper case way

2288
01:36:37,600 --> 01:36:40,400
so you can notice
that this is your last value

2289
01:36:40,400 --> 01:36:42,700
and this is your
uppercase value, right?

2290
01:36:42,700 --> 01:36:43,841
So this got converted

2291
01:36:43,841 --> 01:36:45,900
to my upper case
in this particular.

2292
01:36:45,900 --> 01:36:46,500
Love it.

2293
01:36:46,500 --> 01:36:46,900
Now.

2294
01:36:46,900 --> 01:36:49,123
If you notice here
also same steps.

2295
01:36:49,123 --> 01:36:52,000
We are how to we
can register all of our UDF.

2296
01:36:52,000 --> 01:36:53,620
This is not being shown here.

2297
01:36:53,620 --> 01:36:55,800
So now this is
how you can do that spark

2298
01:36:55,800 --> 01:36:57,354
that UDF not register.

2299
01:36:57,354 --> 01:36:58,574
So using this API,

2300
01:36:58,574 --> 01:37:02,100
you can just register
your data frames now similarly,

2301
01:37:02,100 --> 01:37:03,870
if you want to get the output

2302
01:37:03,870 --> 01:37:06,800
after that you can get
it using this following me

2303
01:37:06,800 --> 01:37:09,900
so you can use the show API
to get the output

2304
01:37:09,900 --> 01:37:12,100
for this Sparks
equal at attacher.

2305
01:37:12,100 --> 01:37:13,800
Let's see that so what is Park

2306
01:37:13,800 --> 01:37:16,400
sequel architecture now is
Park sequel architecture

2307
01:37:16,400 --> 01:37:18,100
if we talked about so
what happens to your let

2308
01:37:18,100 --> 01:37:19,900
's say getting the data
of with using

2309
01:37:19,900 --> 01:37:21,500
your various formats, right?

2310
01:37:21,500 --> 01:37:23,911
So let's say you can get
it from your CSP.

2311
01:37:23,911 --> 01:37:26,056
You can get it
from your Json format.

2312
01:37:26,056 --> 01:37:28,475
You can also get it
from your jdbc format.

2313
01:37:28,475 --> 01:37:30,400
Now, they will be
a data source API.

2314
01:37:30,400 --> 01:37:31,708
So using data source API,

2315
01:37:31,708 --> 01:37:34,273
you can fetch the data
after fetching the data

2316
01:37:34,273 --> 01:37:36,300
you will be converting
to a data frame

2317
01:37:36,300 --> 01:37:38,000
where so what is data frame.

2318
01:37:38,000 --> 01:37:39,833
So in the last one
we have learned

2319
01:37:39,833 --> 01:37:42,892
that that when we were creating
everything is already

2320
01:37:42,892 --> 01:37:43,900
what we were doing.

2321
01:37:43,900 --> 01:37:46,437
So, let's say this was
my Cluster, right?

2322
01:37:46,437 --> 01:37:48,358
So let's say this is machine.

2323
01:37:48,358 --> 01:37:49,860
This is another machine.

2324
01:37:49,860 --> 01:37:51,800
This is another machine, right?

2325
01:37:51,800 --> 01:37:53,757
So let's say these are
all my clusters.

2326
01:37:53,757 --> 01:37:55,703
So what we were doing
in this case now

2327
01:37:55,703 --> 01:37:58,700
when we were creating all
these things are as were cluster

2328
01:37:58,700 --> 01:38:00,000
what was happening here.

2329
01:38:00,000 --> 01:38:02,600
We were passing
Oliver values him, right?

2330
01:38:02,600 --> 01:38:04,739
So let's say we
were keeping all the data.

2331
01:38:04,739 --> 01:38:06,200
Let's say block B1 was there

2332
01:38:06,200 --> 01:38:08,850
so we were passing all
the values and work creating it

2333
01:38:08,850 --> 01:38:11,400
in the form of in the memory
and we were calling

2334
01:38:11,400 --> 01:38:12,800
that as rdd now

2335
01:38:12,800 --> 01:38:16,094
when we were walking in SQL
we have to store the the data

2336
01:38:16,094 --> 01:38:17,900
which is a table of data, right?

2337
01:38:17,900 --> 01:38:19,200
So let's say there is a table

2338
01:38:19,200 --> 01:38:21,200
which is let's say
having column details.

2339
01:38:21,200 --> 01:38:23,200
Let's say name age.

2340
01:38:23,200 --> 01:38:24,024
Let's say here.

2341
01:38:24,024 --> 01:38:26,236
I have some value here
are some value here.

2342
01:38:26,236 --> 01:38:28,506
I have some value here
at some value, right?

2343
01:38:28,506 --> 01:38:31,200
So let's say I have some value
of this table format.

2344
01:38:31,200 --> 01:38:34,200
Now if I have to keep
this data into my cluster

2345
01:38:34,200 --> 01:38:35,200
what you need to do,

2346
01:38:35,200 --> 01:38:37,962
so you will be keeping first
of all into the memory.

2347
01:38:37,962 --> 01:38:39,100
So you will be having

2348
01:38:39,100 --> 01:38:42,418
let's say name H this column
to test first of all year

2349
01:38:42,418 --> 01:38:45,767
and after that you will be
having some details of this.

2350
01:38:45,767 --> 01:38:46,210
Perfect.

2351
01:38:46,210 --> 01:38:47,804
So let's say this much data,

2352
01:38:47,804 --> 01:38:49,900
you have some part
in the similar kind

2353
01:38:49,900 --> 01:38:52,572
of table with some other values
will be here also,

2354
01:38:52,572 --> 01:38:55,300
but here also you are going
to have column details.

2355
01:38:55,300 --> 01:38:58,500
You will be having name H
some more data here.

2356
01:38:58,600 --> 01:39:02,600
Now if you notice this
is sounding similar to our DD,

2357
01:39:02,700 --> 01:39:06,000
but this is not exactly
like our GD right

2358
01:39:06,000 --> 01:39:09,400
because here we are not only
keeping just the data but we

2359
01:39:09,400 --> 01:39:12,500
are also studying something
like a column in a storage

2360
01:39:12,500 --> 01:39:12,861
right?

2361
01:39:12,861 --> 01:39:15,400
We also the keeping
the column in all of it.

2362
01:39:15,400 --> 01:39:18,500
Data nodes or we can call it as
if Burke or not, right?

2363
01:39:18,500 --> 01:39:20,653
So we are also keeping
the column vectors

2364
01:39:20,653 --> 01:39:22,000
along with the rule test.

2365
01:39:22,000 --> 01:39:24,700
So this thing is called
as data frames.

2366
01:39:24,700 --> 01:39:26,600
Okay, so that is called
your data frame.

2367
01:39:26,600 --> 01:39:29,400
So that is what we are going to
do is we are going to convert it

2368
01:39:29,400 --> 01:39:31,057
to a data frame API then

2369
01:39:31,057 --> 01:39:35,200
using the data frame TSS or by
using Sparks equal to H square

2370
01:39:35,200 --> 01:39:37,550
or you will be processing
the results and giving

2371
01:39:37,550 --> 01:39:40,300
the output we will learn about
all these things in detail.

2372
01:39:40,600 --> 01:39:44,100
So, let's see this Popsicle
libraries now there are

2373
01:39:44,100 --> 01:39:45,800
multiple apis available.

2374
01:39:45,800 --> 01:39:48,700
This like we have
data source API we

2375
01:39:48,700 --> 01:39:50,500
have data frame API.

2376
01:39:50,500 --> 01:39:53,510
We have interpreter
and Optimizer and SQL service.

2377
01:39:53,510 --> 01:39:55,600
We will explore
all this in detail.

2378
01:39:55,600 --> 01:39:58,000
So let's talk about
data source appear

2379
01:39:58,000 --> 01:40:02,787
if we talk about data source API
what happens in data source API,

2380
01:40:02,787 --> 01:40:04,133
it is used to read

2381
01:40:04,133 --> 01:40:07,364
and store the structured
and unstructured data

2382
01:40:07,364 --> 01:40:08,800
into your spark SQL.

2383
01:40:08,800 --> 01:40:12,200
So as you can notice in Sparks
equal we can give fetch the data

2384
01:40:12,200 --> 01:40:13,437
using multiple sources

2385
01:40:13,437 --> 01:40:15,800
like you can get it
from hive take Cosette.

2386
01:40:15,800 --> 01:40:18,800
Inverse ESP Apache
BSD base Oracle DB so

2387
01:40:18,800 --> 01:40:20,300
many formats available, right?

2388
01:40:20,300 --> 01:40:21,427
So this API is going

2389
01:40:21,427 --> 01:40:24,956
to help you to get all the data
to read all the data store it

2390
01:40:24,956 --> 01:40:26,700
where ever you want to use it.

2391
01:40:26,700 --> 01:40:28,387
Now after that your data

2392
01:40:28,387 --> 01:40:31,200
frame API is going
to help you to convert

2393
01:40:31,200 --> 01:40:33,100
that into a named Colin

2394
01:40:33,100 --> 01:40:34,700
and remember I
just explained you

2395
01:40:34,800 --> 01:40:36,902
that how you store
the data in that

2396
01:40:36,902 --> 01:40:39,793
because here you are not keeping
like I did it.

2397
01:40:39,793 --> 01:40:42,100
You're also keeping
the named column as

2398
01:40:42,100 --> 01:40:45,500
well as Road it is That is
the difference coming up here.

2399
01:40:45,500 --> 01:40:47,382
So that is
what it is converting.

2400
01:40:47,382 --> 01:40:48,100
In this case.

2401
01:40:48,100 --> 01:40:50,561
We are using data
frame API to convert it

2402
01:40:50,561 --> 01:40:52,900
into your named column
and rows, right?

2403
01:40:52,900 --> 01:40:54,600
So that is what you
will be doing.

2404
01:40:54,600 --> 01:40:57,700
So at it also follows the same
properties like your IDs

2405
01:40:57,700 --> 01:40:59,993
like your attitude is
Pearl easily evaluated

2406
01:40:59,993 --> 01:41:02,500
in all same properties
will also follow up here.

2407
01:41:02,500 --> 01:41:06,000
Okay now interpret
an Optimizer and interpreter

2408
01:41:06,000 --> 01:41:08,485
and Optimizer step
what we are going to do.

2409
01:41:08,485 --> 01:41:11,184
So, let's see if we have
this data frame API,

2410
01:41:11,184 --> 01:41:13,700
so we are going to first
create this name.

2411
01:41:13,700 --> 01:41:17,800
Column then after that we
will be now creating an rdd.

2412
01:41:17,800 --> 01:41:20,400
We will be applying
our transformation step.

2413
01:41:20,400 --> 01:41:23,877
We will be doing over action
step right to Output the value.

2414
01:41:23,877 --> 01:41:25,040
So all those things

2415
01:41:25,040 --> 01:41:28,100
where it is happens it happening
in The Interpreter

2416
01:41:28,100 --> 01:41:29,400
and optimizes them.

2417
01:41:29,400 --> 01:41:33,500
So this is all happening
in The Interpreter and optimism.

2418
01:41:33,600 --> 01:41:36,000
So this is what all
the features you have.

2419
01:41:36,000 --> 01:41:39,500
Now, let's talk about
SQL service now in SQL service

2420
01:41:39,500 --> 01:41:41,934
what happens it is going
to again help you

2421
01:41:41,934 --> 01:41:43,698
so it is just doing the order.

2422
01:41:43,698 --> 01:41:45,200
Formation action the last day

2423
01:41:45,200 --> 01:41:47,567
after that using
spark SQL service,

2424
01:41:47,567 --> 01:41:50,700
you will be getting
your spark sequel outputs.

2425
01:41:50,700 --> 01:41:54,200
So now in this case whatever
processing you have done right

2426
01:41:54,200 --> 01:41:57,500
in terms of transformations
in all of that so you can see

2427
01:41:57,500 --> 01:42:01,600
that your sparkers SQL service
is an entry point for working

2428
01:42:01,600 --> 01:42:04,486
along the structure data
in your aperture spur.

2429
01:42:04,486 --> 01:42:04,800
Okay.

2430
01:42:04,800 --> 01:42:07,611
So it is going to kind of
help you to fetch the results

2431
01:42:07,611 --> 01:42:08,700
from your optimize data

2432
01:42:08,700 --> 01:42:10,900
or maybe whatever you
have interpreted before

2433
01:42:10,900 --> 01:42:12,100
so that is what it's doing.

2434
01:42:12,100 --> 01:42:13,400
So this kind of completes.

2435
01:42:13,500 --> 01:42:15,400
This whole diagram now,

2436
01:42:15,400 --> 01:42:18,082
let us see that how we
can perform a work queries

2437
01:42:18,082 --> 01:42:19,200
using spark sequin.

2438
01:42:19,200 --> 01:42:21,435
Now if we talk
about spark SQL queries,

2439
01:42:21,435 --> 01:42:22,376
so first of all,

2440
01:42:22,376 --> 01:42:25,348
we can go to spark cell itself
engine execute everything.

2441
01:42:25,348 --> 01:42:27,253
You can also execute
your program using

2442
01:42:27,253 --> 01:42:29,500
spark your Eclipse also
directing from there.

2443
01:42:29,500 --> 01:42:30,600
Also, you can do that.

2444
01:42:30,600 --> 01:42:33,249
So if you are let's say log in
with your spark shell session.

2445
01:42:33,249 --> 01:42:34,200
So what you can do,

2446
01:42:34,200 --> 01:42:36,700
so let's say you have first
you need to import this

2447
01:42:36,700 --> 01:42:38,464
because into point x
you must have heard

2448
01:42:38,464 --> 01:42:40,500
that there is something
called as Park session

2449
01:42:40,500 --> 01:42:42,197
which came so that is
what we are doing.

2450
01:42:42,197 --> 01:42:44,200
So in our last session
we have Have you learned

2451
01:42:44,200 --> 01:42:47,077
about all these things are
now Sparkstation is something

2452
01:42:47,077 --> 01:42:48,700
but we're importing after that.

2453
01:42:48,700 --> 01:42:51,940
We are creating sessions path
using a builder function.

2454
01:42:51,940 --> 01:42:52,704
Look at this.

2455
01:42:52,704 --> 01:42:55,822
So This Builder API you we
are using this Builder API,

2456
01:42:55,822 --> 01:42:57,458
then we are using the app name.

2457
01:42:57,458 --> 01:43:00,256
We are providing a configuration
and then we are telling

2458
01:43:00,256 --> 01:43:02,860
that we are going to create
our values here, right?

2459
01:43:02,860 --> 01:43:05,100
So we had that's why
we are giving get okay,

2460
01:43:05,100 --> 01:43:07,987
then we are importing
all these things right

2461
01:43:07,987 --> 01:43:09,800
once we imported after that

2462
01:43:09,800 --> 01:43:10,900
we can say that okay.

2463
01:43:10,900 --> 01:43:12,731
We were want to read
this Json file.

2464
01:43:12,731 --> 01:43:15,400
So this implies God
or Jason we want to read up here

2465
01:43:15,400 --> 01:43:18,398
and in the end we want
to Output this value, right?

2466
01:43:18,398 --> 01:43:21,700
So this d f becomes my data
frame containing store value

2467
01:43:21,700 --> 01:43:23,188
of my employed or Jason.

2468
01:43:23,188 --> 01:43:25,655
So this decent value
will get converted

2469
01:43:25,655 --> 01:43:26,710
to my data frame.

2470
01:43:26,710 --> 01:43:30,000
We're now in the end PR just
outputting the result now

2471
01:43:30,000 --> 01:43:32,100
if you notice here
what we are doing,

2472
01:43:32,100 --> 01:43:33,312
so here we are first

2473
01:43:33,312 --> 01:43:36,100
of all importing your spark
session same story.

2474
01:43:36,100 --> 01:43:37,200
We just executing it.

2475
01:43:37,200 --> 01:43:39,500
Then we are building
our things better in that.

2476
01:43:39,500 --> 01:43:41,000
We're going to
create that again.

2477
01:43:41,000 --> 01:43:44,243
We are importing it then
we are reading Json file

2478
01:43:44,243 --> 01:43:46,000
by using Red Dot Json API.

2479
01:43:46,000 --> 01:43:47,900
We are reading
never employed or Jason.

2480
01:43:47,900 --> 01:43:50,428
Okay, which is present
in this particular directory

2481
01:43:50,428 --> 01:43:52,400
and we are outputting
so can you can see

2482
01:43:52,400 --> 01:43:55,300
that Json format will be
the T value format.

2483
01:43:55,300 --> 01:43:59,200
But when I'm doing this DF
not show it is just showing

2484
01:43:59,200 --> 01:44:00,700
up all my values here.

2485
01:44:00,700 --> 01:44:00,935
Now.

2486
01:44:00,935 --> 01:44:03,138
Let's see how we
can create our data set.

2487
01:44:03,138 --> 01:44:04,900
Now when we talk about data set,

2488
01:44:04,900 --> 01:44:06,500
you can notice
what we're doing.

2489
01:44:06,500 --> 01:44:06,700
Now.

2490
01:44:06,700 --> 01:44:09,200
We have understood all
this stability the how we

2491
01:44:09,200 --> 01:44:12,300
can create a data set now
first of all in data set

2492
01:44:12,300 --> 01:44:14,800
what we do so So
in data set we can create

2493
01:44:14,800 --> 01:44:17,900
the plus you can see we
are creating a case class employ

2494
01:44:17,900 --> 01:44:19,600
right now in case class

2495
01:44:19,600 --> 01:44:22,400
what we are doing we are done
just creating a sequence

2496
01:44:22,400 --> 01:44:25,600
in putting the value Andrew H
like name and age column.

2497
01:44:25,600 --> 01:44:28,076
Then we are displaying
our output all this data

2498
01:44:28,076 --> 01:44:28,803
set right now.

2499
01:44:28,803 --> 01:44:32,010
We are creating a primitive data
set also to demonstrate mapping

2500
01:44:32,010 --> 01:44:33,894
of this data frames
to your data sets.

2501
01:44:33,894 --> 01:44:34,200
Right?

2502
01:44:34,200 --> 01:44:36,200
So you can notice
that we are using

2503
01:44:36,200 --> 01:44:37,700
to D's instead of 2 DF.

2504
01:44:37,700 --> 01:44:39,500
We are using two DS
in this case.

2505
01:44:39,500 --> 01:44:42,293
Now, you may ask me what's
the difference with respect

2506
01:44:42,293 --> 01:44:43,400
to data frame, right?

2507
01:44:43,400 --> 01:44:45,100
With respect to data frame

2508
01:44:45,100 --> 01:44:46,700
in data frame
what we were doing.

2509
01:44:46,700 --> 01:44:48,682
We were create
again the data frame

2510
01:44:48,682 --> 01:44:50,800
and data set both
exactly looks safe.

2511
01:44:50,800 --> 01:44:53,228
It will also be having
the name column in rows

2512
01:44:53,228 --> 01:44:54,200
and everything up.

2513
01:44:54,200 --> 01:44:57,334
It is introduced lately
in 1.6 versions and later.

2514
01:44:57,334 --> 01:44:58,196
And what is it

2515
01:44:58,196 --> 01:45:01,100
provides it it provides
a encoder mechanism using

2516
01:45:01,100 --> 01:45:02,000
which you can get

2517
01:45:02,000 --> 01:45:04,208
when you are let's say
reading the weight data back.

2518
01:45:04,208 --> 01:45:06,200
Let's say you are DC
realizing you're not doing

2519
01:45:06,200 --> 01:45:06,968
that step, right?

2520
01:45:06,968 --> 01:45:08,300
It is going to be faster.

2521
01:45:08,300 --> 01:45:10,400
So the performance
wise data set is better.

2522
01:45:10,400 --> 01:45:13,000
That's the reason it
is introduced later nowadays.

2523
01:45:13,000 --> 01:45:15,794
People are moving from
data frame two data sets Okay.

2524
01:45:15,794 --> 01:45:17,500
So now we are just outputting

2525
01:45:17,500 --> 01:45:19,703
in the end see the same
thing in the output.

2526
01:45:19,703 --> 01:45:21,623
But so we are creating
employ a class.

2527
01:45:21,623 --> 01:45:24,684
Then we are putting the value
inside it creating a data set.

2528
01:45:24,684 --> 01:45:26,500
We are looking
at the values, right?

2529
01:45:26,500 --> 01:45:29,200
So these are the steps we
have just understood them now

2530
01:45:29,200 --> 01:45:32,000
how we can read of a Phi so
we want to read the file.

2531
01:45:32,000 --> 01:45:35,300
So we will use three dot Json
as employee employee was

2532
01:45:35,300 --> 01:45:38,026
what remember case class which
we have created last thing.

2533
01:45:38,026 --> 01:45:39,700
This was the classic
we have created

2534
01:45:39,700 --> 01:45:40,900
your case class employee.

2535
01:45:40,900 --> 01:45:43,300
So we are telling
that we are creating like this.

2536
01:45:43,500 --> 01:45:45,200
We are just out
putting this value.

2537
01:45:45,200 --> 01:45:47,612
We just within shop
you can see this way.

2538
01:45:47,612 --> 01:45:49,000
We can see this output.

2539
01:45:49,000 --> 01:45:50,700
Also now, let's see

2540
01:45:50,700 --> 01:45:53,900
how we can add the schema
to rdd now in order

2541
01:45:53,900 --> 01:45:57,300
to add the schema to rdd
what we are going to do.

2542
01:45:57,300 --> 01:45:59,100
So in this case also,

2543
01:45:59,200 --> 01:46:01,500
you can look at we
are importing all the values

2544
01:46:01,500 --> 01:46:03,700
that we are importing all
the libraries whatever

2545
01:46:03,700 --> 01:46:04,779
are required then

2546
01:46:04,779 --> 01:46:07,622
after that we are using
this spark context text

2547
01:46:07,622 --> 01:46:09,600
by reading the data splitting it

2548
01:46:09,600 --> 01:46:12,400
with respect to comma then
mapping the attributes.

2549
01:46:12,400 --> 01:46:14,750
We will employ The case
that's what we have done

2550
01:46:14,750 --> 01:46:17,041
and putting converting
this values to integer.

2551
01:46:17,041 --> 01:46:19,891
So in then we are converting
to to death right after that.

2552
01:46:19,891 --> 01:46:22,378
We are going to create
a temporary viewer table.

2553
01:46:22,378 --> 01:46:24,600
So let's create
this temporary view employ.

2554
01:46:24,600 --> 01:46:26,800
Then we are going
to use part dot Sequel

2555
01:46:26,800 --> 01:46:28,570
and passing up our SQL query.

2556
01:46:28,570 --> 01:46:31,500
Can you notice that we
have now passing the value

2557
01:46:31,500 --> 01:46:33,900
and we are assessing
this employ, right?

2558
01:46:33,900 --> 01:46:36,000
We are assessing
this employee here.

2559
01:46:36,000 --> 01:46:38,500
Now, what is this employ
this employee was

2560
01:46:38,500 --> 01:46:40,500
of a temporary view
which we have created

2561
01:46:40,500 --> 01:46:43,128
because the challenge
in Sparks equalist

2562
01:46:43,128 --> 01:46:46,329
when Whether you want
to execute any SQL query you

2563
01:46:46,329 --> 01:46:49,400
cannot say select aesthetic
from the data frame.

2564
01:46:49,400 --> 01:46:50,439
You cannot do that.

2565
01:46:50,439 --> 01:46:52,300
There's this is
not even supported.

2566
01:46:52,300 --> 01:46:55,547
So you cannot do select extract
from your data frame.

2567
01:46:55,547 --> 01:46:56,508
So instead of that

2568
01:46:56,508 --> 01:46:59,500
what we need to do is we need
to create a temporary table

2569
01:46:59,500 --> 01:47:01,732
or a temporary view
so you can notice here.

2570
01:47:01,732 --> 01:47:04,456
We are using this create
or replace temp You by replace

2571
01:47:04,456 --> 01:47:07,349
because if it is already
existing override on top of it.

2572
01:47:07,349 --> 01:47:09,400
So now we are creating
a temporary table

2573
01:47:09,400 --> 01:47:12,900
which will be exactly similar
to mine this data frame now

2574
01:47:12,900 --> 01:47:15,605
you You can just directly
execute all the query

2575
01:47:15,605 --> 01:47:18,100
on your return preview
Autumn Prairie table.

2576
01:47:18,100 --> 01:47:21,258
So you can notice here
instead of using employ DF

2577
01:47:21,258 --> 01:47:22,800
which was our data frame.

2578
01:47:22,800 --> 01:47:24,730
I am using here temporary view.

2579
01:47:24,730 --> 01:47:26,100
Okay, then in the end,

2580
01:47:26,100 --> 01:47:28,000
we just mapping
the names and a right

2581
01:47:28,000 --> 01:47:29,669
and we are outputting the bells.

2582
01:47:29,669 --> 01:47:30,200
That's it.

2583
01:47:30,200 --> 01:47:31,000
Same thing.

2584
01:47:31,000 --> 01:47:33,300
This is just
an execution part of it.

2585
01:47:33,300 --> 01:47:35,350
So we are just showing
all the steps here.

2586
01:47:35,350 --> 01:47:36,500
You can see in the end.

2587
01:47:36,500 --> 01:47:38,500
We are outputting
all this value now

2588
01:47:38,600 --> 01:47:40,800
how we can add
the schema to rdd.

2589
01:47:40,800 --> 01:47:43,850
Let's see this transformation
step now in this case you Notice

2590
01:47:43,850 --> 01:47:45,404
that we can map
this youngster fact

2591
01:47:45,404 --> 01:47:46,900
the we're converting
this map name

2592
01:47:46,900 --> 01:47:49,211
into the string for
the transformation part, right?

2593
01:47:49,211 --> 01:47:51,200
So we are checking all
this value that okay.

2594
01:47:51,200 --> 01:47:53,500
This is the string type name.

2595
01:47:53,500 --> 01:47:55,900
We are just showing up
this value right now.

2596
01:47:55,900 --> 01:47:56,900
What were you doing?

2597
01:47:56,900 --> 01:48:00,400
We are using this map encoder
from the implicit class,

2598
01:48:00,400 --> 01:48:03,717
which is available to us
to map the name and Each pie.

2599
01:48:03,717 --> 01:48:04,000
Okay.

2600
01:48:04,000 --> 01:48:05,529
So this is
what we're going to do

2601
01:48:05,529 --> 01:48:07,579
because remember in
the employee is class.

2602
01:48:07,579 --> 01:48:10,400
We have the name and age column
that we want to map now.

2603
01:48:10,400 --> 01:48:11,272
Now in this case,

2604
01:48:11,272 --> 01:48:13,164
we are mapping
the names to the ages.

2605
01:48:13,164 --> 01:48:14,400
Has so you can notice

2606
01:48:14,400 --> 01:48:17,600
that we are doing for ages
of our younger CF data frame

2607
01:48:17,600 --> 01:48:19,335
that what we
have created earlier

2608
01:48:19,335 --> 01:48:20,800
and the result is an array.

2609
01:48:20,800 --> 01:48:23,400
So the result but you're going
to get will be an array

2610
01:48:23,400 --> 01:48:25,700
with the name map
to your respective ages.

2611
01:48:25,700 --> 01:48:27,800
You can see this output
here so you can see

2612
01:48:27,800 --> 01:48:29,100
that this is getting map.

2613
01:48:29,100 --> 01:48:29,426
Right.

2614
01:48:29,426 --> 01:48:32,201
So we are getting seeing
this output like name is John

2615
01:48:32,201 --> 01:48:34,402
it is 28 that is what
we are talking about.

2616
01:48:34,402 --> 01:48:36,300
So here in this case,
you can notice

2617
01:48:36,300 --> 01:48:38,900
that it was representing
like this in this case.

2618
01:48:38,900 --> 01:48:42,200
The output is coming out
in this particular format now,

2619
01:48:42,200 --> 01:48:44,568
let's talk about
how Can add the schema

2620
01:48:44,568 --> 01:48:47,674
how we can read the file
we can add a whiskey minor

2621
01:48:47,674 --> 01:48:50,702
so we will be first
of all importing the type class

2622
01:48:50,702 --> 01:48:51,706
into your passion.

2623
01:48:51,706 --> 01:48:52,588
So with this is

2624
01:48:52,588 --> 01:48:54,815
what we have done
by using import statement.

2625
01:48:54,815 --> 01:48:58,286
Then we are going to import
the row class into this partial.

2626
01:48:58,286 --> 01:49:00,500
So rho will be used
in mapping our DB schema.

2627
01:49:00,500 --> 01:49:00,813
Right?

2628
01:49:00,813 --> 01:49:01,700
So you can notice

2629
01:49:01,700 --> 01:49:05,100
we're importing this also then
we are creating an rdd called

2630
01:49:05,000 --> 01:49:06,200
as employ a DD.

2631
01:49:06,200 --> 01:49:07,900
So in case this case
you can notice

2632
01:49:07,900 --> 01:49:09,809
that the same priority
we are creating

2633
01:49:09,809 --> 01:49:12,700
and we are creating this
with the help of this text file.

2634
01:49:12,700 --> 01:49:15,700
So once we have create this we
are going to Define our schema.

2635
01:49:15,700 --> 01:49:17,300
So this is the scheme approach.

2636
01:49:17,300 --> 01:49:17,572
Okay.

2637
01:49:17,572 --> 01:49:18,452
So in this case,

2638
01:49:18,452 --> 01:49:21,050
we are going to Define it
like named and space

2639
01:49:21,050 --> 01:49:21,800
than H. Okay,

2640
01:49:21,800 --> 01:49:24,700
because they these were
the two I have in my data also

2641
01:49:24,700 --> 01:49:26,129
in this employed or tht

2642
01:49:26,129 --> 01:49:27,305
if you look at these

2643
01:49:27,305 --> 01:49:29,600
are the two data which
we have named NH.

2644
01:49:29,600 --> 01:49:31,635
Now what we can do
once we have done

2645
01:49:31,635 --> 01:49:34,100
that then we can split it
with respect to space.

2646
01:49:34,100 --> 01:49:34,600
We can say

2647
01:49:34,600 --> 01:49:37,082
that our mapping value
and we are passing it

2648
01:49:37,082 --> 01:49:39,200
all this value inside
of a structure.

2649
01:49:39,200 --> 01:49:42,200
Okay, so we are defining a burn
or fields are ready.

2650
01:49:42,200 --> 01:49:43,500
That is what we are doing.

2651
01:49:43,500 --> 01:49:45,200
See this the fields are ready,

2652
01:49:45,200 --> 01:49:49,500
which is going to now output
after mapping the employee ID.

2653
01:49:49,500 --> 01:49:51,200
Okay, so that is
what we are doing.

2654
01:49:51,200 --> 01:49:54,413
So we want to just do this
into my schema strength,

2655
01:49:54,413 --> 01:49:55,375
then in the end.

2656
01:49:55,375 --> 01:49:57,300
We will be obtaining this field.

2657
01:49:57,300 --> 01:49:59,940
If you notice this field
what we have created here.

2658
01:49:59,940 --> 01:50:01,788
We are obtaining
this into a schema.

2659
01:50:01,788 --> 01:50:03,900
So we are passing this
into a struct type

2660
01:50:03,900 --> 01:50:06,400
and it is getting converted
to be our scheme of it.

2661
01:50:06,500 --> 01:50:08,200
So that is what we will do.

2662
01:50:08,200 --> 01:50:10,768
You can see all
this execution same steps.

2663
01:50:10,768 --> 01:50:13,357
We are just executing
in this terminal now,

2664
01:50:13,357 --> 01:50:16,500
Let's see how we are going
to transform the results.

2665
01:50:16,500 --> 01:50:18,300
Now, whatever we
have done, right?

2666
01:50:18,300 --> 01:50:21,229
So now we have already created
already called row editing.

2667
01:50:21,229 --> 01:50:22,000
So let's create

2668
01:50:22,000 --> 01:50:25,088
that Rogue additive are going
to Gray and we want

2669
01:50:25,088 --> 01:50:28,500
to transform the employee ID
using the map function

2670
01:50:28,500 --> 01:50:29,513
into row already.

2671
01:50:29,513 --> 01:50:30,564
So let's do that.

2672
01:50:30,564 --> 01:50:30,837
Okay.

2673
01:50:30,837 --> 01:50:31,717
So in this case

2674
01:50:31,717 --> 01:50:34,483
what we are doing so look
at this employed reading

2675
01:50:34,483 --> 01:50:36,797
we are splitting it
with respect to coma

2676
01:50:36,797 --> 01:50:40,000
and after that we are telling
see remember we have name

2677
01:50:40,000 --> 01:50:41,400
and then H like this so

2678
01:50:41,400 --> 01:50:43,500
that's what you're telling
me telling that act.

2679
01:50:43,500 --> 01:50:44,737
Zero or my attributes

2680
01:50:44,737 --> 01:50:47,796
one and why we're trimming
it just inverted to ensure

2681
01:50:47,796 --> 01:50:49,900
if there is no spaces
and on which other

2682
01:50:49,900 --> 01:50:52,600
so those things we don't want
to unnecessarily keep up.

2683
01:50:52,600 --> 01:50:55,400
So that's the reason we are
defining this term statement.

2684
01:50:55,400 --> 01:50:58,300
Now after that after we
once we are done with this,

2685
01:50:58,300 --> 01:51:01,100
we are going to define
a data frame employed EF

2686
01:51:01,100 --> 01:51:03,874
and we are going to store
that rdd schema into it.

2687
01:51:03,874 --> 01:51:05,764
So now if you notice
this row ID,

2688
01:51:05,764 --> 01:51:07,300
which we have defined here

2689
01:51:07,300 --> 01:51:11,124
and schema which we have defined
in the last case right now

2690
01:51:11,124 --> 01:51:13,300
if you'll go back
and notice here.

2691
01:51:13,300 --> 01:51:16,300
Schema, we have created here
right with respect to my Fields.

2692
01:51:16,600 --> 01:51:19,100
So that schema and this value

2693
01:51:19,100 --> 01:51:21,900
what we have just
created here rowady.

2694
01:51:21,900 --> 01:51:23,450
We are going to pass it and say

2695
01:51:23,450 --> 01:51:25,200
that we are going
to create a data frame.

2696
01:51:25,200 --> 01:51:27,900
So this will help us
in creating a data frame now,

2697
01:51:27,900 --> 01:51:31,135
we can create our temporary view
on the base of employee

2698
01:51:31,135 --> 01:51:33,900
of let's create an employee
or temporary View and then

2699
01:51:33,900 --> 01:51:36,900
what we can do we can execute
any SQL queries on top of it.

2700
01:51:36,900 --> 01:51:38,700
So as you can see
SparkNotes equal we

2701
01:51:38,700 --> 01:51:42,000
can create all the SQL queries
and can directly execute

2702
01:51:42,000 --> 01:51:43,200
that now what we can do.

2703
01:51:43,300 --> 01:51:45,700
We want to Output the values
we can quickly do that.

2704
01:51:45,800 --> 01:51:46,000
Now.

2705
01:51:46,000 --> 01:51:48,500
We want to let's say display
the names of we can say Okay,

2706
01:51:48,500 --> 01:51:51,600
attribute 0 contains the name
we can use the show command.

2707
01:51:51,600 --> 01:51:54,662
So this is how we
will be performing the operation

2708
01:51:54,662 --> 01:51:56,100
in the scheme away now,

2709
01:51:56,100 --> 01:51:58,900
so this is the same output way
means we're just executing

2710
01:51:58,900 --> 01:51:59,914
this whole thing up.

2711
01:51:59,914 --> 01:52:01,100
You can notice here.

2712
01:52:01,100 --> 01:52:03,400
Also, we are just
saying attribute 0.0.

2713
01:52:03,400 --> 01:52:06,205
It is representing
or me my output now,

2714
01:52:06,205 --> 01:52:08,200
let's talk about Json data.

2715
01:52:08,200 --> 01:52:10,085
Now when we talk
about Json data,

2716
01:52:10,085 --> 01:52:13,261
let's talk about how we
can load our files and work on.

2717
01:52:13,261 --> 01:52:15,496
This so in this case,
we will be first.

2718
01:52:15,496 --> 01:52:17,338
Let's say importing
our libraries.

2719
01:52:17,338 --> 01:52:18,800
Once we are done with that.

2720
01:52:18,800 --> 01:52:20,300
Now after that we can just say

2721
01:52:20,300 --> 01:52:23,587
that retort Jason we are
just bringing up our employed

2722
01:52:23,587 --> 01:52:25,611
or Jason you see
this is the execution

2723
01:52:25,611 --> 01:52:27,200
of this part now similarly,

2724
01:52:27,200 --> 01:52:29,042
we can also write
back in the pocket

2725
01:52:29,042 --> 01:52:31,282
or we can also read
the value from parque.

2726
01:52:31,282 --> 01:52:32,400
You can notice this

2727
01:52:32,400 --> 01:52:35,600
if you want to write
let's say this value employee

2728
01:52:35,600 --> 01:52:37,730
of data frame to my market way

2729
01:52:37,730 --> 01:52:40,500
so I can sit right dot
right dot market.

2730
01:52:40,500 --> 01:52:43,143
So this will be created
employed or Park.

2731
01:52:43,143 --> 01:52:46,504
Be created and hear all
the values should be converted

2732
01:52:46,504 --> 01:52:47,900
to employed or packet.

2733
01:52:47,900 --> 01:52:49,133
Only thing is the data.

2734
01:52:49,133 --> 01:52:51,600
If you go and see
in this particular directory,

2735
01:52:51,600 --> 01:52:52,717
this will be a directory.

2736
01:52:52,717 --> 01:52:53,954
We should be getting created.

2737
01:52:53,954 --> 01:52:55,400
So in this data,
you will notice

2738
01:52:55,400 --> 01:52:57,500
that you will not be able
to read the data.

2739
01:52:57,500 --> 01:53:00,100
So in that case
because it's not human readable.

2740
01:53:00,100 --> 01:53:02,200
So that's the reason you
will not be able to do that.

2741
01:53:02,200 --> 01:53:04,299
So, let's say you want
to read it now so you

2742
01:53:04,299 --> 01:53:05,449
can again bring it back

2743
01:53:05,449 --> 01:53:08,600
by using Red Dot Market you are
reading this employed at pocket,

2744
01:53:08,600 --> 01:53:09,600
which I just created

2745
01:53:09,600 --> 01:53:11,700
then you are creating
a temporary view

2746
01:53:11,700 --> 01:53:12,775
or temporary table

2747
01:53:12,775 --> 01:53:15,488
and then By using
standard SQL you can execute

2748
01:53:15,488 --> 01:53:16,903
on your temporary table.

2749
01:53:16,903 --> 01:53:17,844
Now in this way.

2750
01:53:17,844 --> 01:53:21,000
You can read your pocket file
data and in then we are just

2751
01:53:21,000 --> 01:53:24,284
displaying the result see
the similar output of this.

2752
01:53:24,284 --> 01:53:24,600
Okay.

2753
01:53:24,600 --> 01:53:27,100
This is how we can execute
all these things up now.

2754
01:53:27,100 --> 01:53:28,670
Once we have done all this,

2755
01:53:28,670 --> 01:53:31,200
let's see how we
can create our data frames.

2756
01:53:31,200 --> 01:53:33,100
So let's create this file path.

2757
01:53:33,100 --> 01:53:36,390
So let's say we have created
this file employed or Jason

2758
01:53:36,390 --> 01:53:38,508
after that we can
create a data frame

2759
01:53:38,508 --> 01:53:39,943
from our Json path, right?

2760
01:53:39,943 --> 01:53:42,884
So we are creating this
by using retouch Jason then

2761
01:53:42,884 --> 01:53:44,420
we can Print the schema.

2762
01:53:44,420 --> 01:53:47,300
What does to this is going
to print the schema

2763
01:53:47,300 --> 01:53:49,300
of my employee data frame?

2764
01:53:49,300 --> 01:53:52,500
Okay, so we are going to use
this print schemer to print

2765
01:53:52,500 --> 01:53:55,795
up all the values then we
can create a temporary view

2766
01:53:55,795 --> 01:53:57,000
of this data frame.

2767
01:53:57,000 --> 01:53:58,100
So we are create doing

2768
01:53:58,100 --> 01:54:00,618
that see create or replace
temp you we are creating

2769
01:54:00,618 --> 01:54:02,860
that which we have seen
it last time also now

2770
01:54:02,860 --> 01:54:04,888
after that we can
execute our SQL query.

2771
01:54:04,888 --> 01:54:07,800
So let's say we are executing
our SQL query from employee

2772
01:54:07,800 --> 01:54:10,000
where age is between 18
and 30, right?

2773
01:54:10,000 --> 01:54:11,300
So this kind of SQL query.

2774
01:54:11,300 --> 01:54:12,854
Let's say we want
to do we can get

2775
01:54:12,854 --> 01:54:14,989
that And in the end we
can see the output Also.

2776
01:54:14,989 --> 01:54:16,278
Let's see this execution.

2777
01:54:16,278 --> 01:54:17,000
So you can see

2778
01:54:17,000 --> 01:54:20,891
that all the vampires who these
are let's say between 18 and 30

2779
01:54:20,891 --> 01:54:22,900
that is showing up
in the output.

2780
01:54:22,900 --> 01:54:23,147
Now.

2781
01:54:23,147 --> 01:54:25,176
Let's see this
rdd operation way.

2782
01:54:25,176 --> 01:54:26,369
Now what you can do

2783
01:54:26,369 --> 01:54:30,200
so we are going to create this
add any other employer Nene now

2784
01:54:30,200 --> 01:54:33,900
which is going to store
the content of employed George

2785
01:54:33,900 --> 01:54:35,300
and New Delhi Delhi.

2786
01:54:35,300 --> 01:54:36,433
So see this part,

2787
01:54:36,433 --> 01:54:39,500
so here we are creating this
by using make a DD

2788
01:54:39,500 --> 01:54:43,400
and we have just this is going
to store the content containing

2789
01:54:43,400 --> 01:54:45,000
Such from noodle, right?

2790
01:54:45,000 --> 01:54:45,900
You can see this

2791
01:54:45,900 --> 01:54:48,300
so New Delhi is my city
named state is the ring.

2792
01:54:48,300 --> 01:54:50,250
So that is what we
are passing inside it.

2793
01:54:50,250 --> 01:54:52,900
Now what we are doing we
are assigning the content

2794
01:54:52,900 --> 01:54:56,700
of this other employee ID
into my other employees.

2795
01:54:56,700 --> 01:54:59,200
So we are using
this dark dot RI dot Json

2796
01:54:59,200 --> 01:55:00,600
and we are reading at the value

2797
01:55:00,600 --> 01:55:02,800
and in the end we
are using this show appear.

2798
01:55:02,800 --> 01:55:04,857
You can notice
this output coming up now.

2799
01:55:04,857 --> 01:55:06,400
Let's see with the hive table.

2800
01:55:06,400 --> 01:55:08,536
So with the hive table
if you want to read that,

2801
01:55:08,536 --> 01:55:10,186
so let's do it
with the case class

2802
01:55:10,186 --> 01:55:11,136
and Spark sessions.

2803
01:55:11,136 --> 01:55:11,900
So first of all,

2804
01:55:11,900 --> 01:55:14,713
we are going to import
a guru class and we are going

2805
01:55:14,713 --> 01:55:16,700
to use path session
into the Spartan.

2806
01:55:16,700 --> 01:55:18,000
So let's do that for a way.

2807
01:55:18,000 --> 01:55:20,082
I'm putting this row
this past session

2808
01:55:20,082 --> 01:55:21,200
and not after that.

2809
01:55:21,200 --> 01:55:24,186
We are going to create a class
record containing this key

2810
01:55:24,186 --> 01:55:25,756
which is of integer data type

2811
01:55:25,756 --> 01:55:27,576
and a value which is
of string type.

2812
01:55:27,576 --> 01:55:29,426
Then we are going
to set our location

2813
01:55:29,426 --> 01:55:30,726
of the warehouse location.

2814
01:55:30,726 --> 01:55:31,948
Okay to this pathway rows.

2815
01:55:31,948 --> 01:55:33,400
So that is what we are doing.

2816
01:55:33,400 --> 01:55:33,629
Now.

2817
01:55:33,629 --> 01:55:36,100
We are going to build
a spark sessions back

2818
01:55:36,100 --> 01:55:39,200
to demonstrate the hive
example in spots equal.

2819
01:55:39,200 --> 01:55:40,100
Look at this now,

2820
01:55:40,100 --> 01:55:42,700
so we are creating Sparks
session dot Builder again.

2821
01:55:42,700 --> 01:55:44,331
We are passing the Any app name

2822
01:55:44,331 --> 01:55:46,700
to it we have passing
the configuration to it.

2823
01:55:46,700 --> 01:55:48,968
And then we are saying
that we want to enable

2824
01:55:48,968 --> 01:55:50,000
The Hive support now

2825
01:55:50,000 --> 01:55:50,800
once we have done

2826
01:55:50,800 --> 01:55:53,800
that we are importing
this spark SQL library center.

2827
01:55:54,000 --> 01:55:56,612
And then you can notice
that we can use SQL

2828
01:55:56,612 --> 01:55:58,601
so we can create now a table SRC

2829
01:55:58,601 --> 01:56:01,336
so you can see create table
if not exist as RC

2830
01:56:01,336 --> 01:56:04,800
with column to stores the data
as a key common value pair.

2831
01:56:04,800 --> 01:56:06,399
So that is what we
are doing here.

2832
01:56:06,400 --> 01:56:09,000
Now, you can see all
this execution of the same step.

2833
01:56:09,000 --> 01:56:09,209
Now.

2834
01:56:09,209 --> 01:56:12,430
Let's see the sequel operation
happening here now in this case

2835
01:56:12,430 --> 01:56:13,229
what we can do.

2836
01:56:13,229 --> 01:56:15,700
We can now load the data
from this example,

2837
01:56:15,700 --> 01:56:17,500
which is present to succeed.

2838
01:56:17,500 --> 01:56:19,400
Is this KV m dot txt file,

2839
01:56:19,400 --> 01:56:20,869
which is available to us

2840
01:56:20,869 --> 01:56:23,281
and we want to store it
into the table SRC

2841
01:56:23,281 --> 01:56:25,225
which we have just
created and now

2842
01:56:25,225 --> 01:56:28,872
if you want to just view the all
this output becomes a sequence

2843
01:56:28,872 --> 01:56:30,305
select aesthetic form SRC

2844
01:56:30,305 --> 01:56:31,764
and it is going to show up

2845
01:56:31,764 --> 01:56:34,005
all the values you
can see this output.

2846
01:56:34,005 --> 01:56:34,300
Okay.

2847
01:56:34,300 --> 01:56:37,341
So this is the way you can show
up the virus now similarly we

2848
01:56:37,341 --> 01:56:38,899
can perform the count operation.

2849
01:56:38,899 --> 01:56:40,993
Okay, so we can say
select Counter-Strike

2850
01:56:40,993 --> 01:56:43,400
from SRC to select the number
of keys in there.

2851
01:56:43,400 --> 01:56:45,858
See tables, and now
select all the records,

2852
01:56:45,858 --> 01:56:48,800
right so we can say
that key select key gamma value

2853
01:56:48,800 --> 01:56:49,500
so you can see

2854
01:56:49,500 --> 01:56:52,150
that we can perform all
over Hive operations here

2855
01:56:52,150 --> 01:56:53,562
on this right similarly.

2856
01:56:53,562 --> 01:56:56,300
We can create a data set
string DS from spark DF

2857
01:56:56,300 --> 01:56:58,623
so you can see this
also by using SQL DF

2858
01:56:58,623 --> 01:57:00,835
what we already have
we can just say map

2859
01:57:00,835 --> 01:57:01,730
and then provide

2860
01:57:01,730 --> 01:57:04,541
the case class in can map
the ski common value pair

2861
01:57:04,541 --> 01:57:07,600
and then in the end we
can show up all this value see

2862
01:57:07,600 --> 01:57:10,644
this execution of this in then
you can notice this output

2863
01:57:10,644 --> 01:57:11,828
which we want it now.

2864
01:57:11,828 --> 01:57:13,288
Let's see the result back.

2865
01:57:13,288 --> 01:57:15,700
But now we can create
our data frame here.

2866
01:57:15,700 --> 01:57:18,384
Right so we can create
our data frame records deaf

2867
01:57:18,384 --> 01:57:19,848
and store all the results

2868
01:57:19,848 --> 01:57:21,900
which contains the value
between 1 200.

2869
01:57:21,900 --> 01:57:24,600
So we are storing all the values
between 1/2 and video.

2870
01:57:24,600 --> 01:57:26,700
Then we are creating
a victim Prairie View.

2871
01:57:26,700 --> 01:57:28,900
Okay for the records,
that's what we are doing.

2872
01:57:28,900 --> 01:57:31,200
So for requires the FAA
creating a temporary view

2873
01:57:31,200 --> 01:57:33,800
so that we can have
over Oliver SQL queries now,

2874
01:57:33,800 --> 01:57:35,336
we can execute all the values

2875
01:57:35,336 --> 01:57:38,400
so you can also notice we
are doing join operation here.

2876
01:57:38,400 --> 01:57:40,900
Okay, so we can display
the content of join

2877
01:57:40,900 --> 01:57:43,300
between the records
and this is our city.

2878
01:57:43,600 --> 01:57:46,400
We can do a joint on this part
so we can also perform all

2879
01:57:46,400 --> 01:57:48,300
the joint operations
and get the output.

2880
01:57:48,300 --> 01:57:48,500
Now.

2881
01:57:48,500 --> 01:57:50,356
Let's see our use case for it.

2882
01:57:50,356 --> 01:57:51,908
If we talk about use case.

2883
01:57:51,908 --> 01:57:55,071
We are going to analyze
our stock market with the help

2884
01:57:55,071 --> 01:57:57,100
of spark sequence
select understand

2885
01:57:57,100 --> 01:57:58,500
the problem statement first.

2886
01:57:58,500 --> 01:58:00,382
So now in our problem statement,

2887
01:58:00,382 --> 01:58:04,029
so what we want to do so we want
to accept definitely everybody

2888
01:58:04,029 --> 01:58:07,156
must be aware of this top market
like in stock market.

2889
01:58:07,156 --> 01:58:08,811
You can lot
of activities happen.

2890
01:58:08,811 --> 01:58:10,400
You want to know analyze it

2891
01:58:10,400 --> 01:58:13,300
in order to make some profit
out of it and all those stuff.

2892
01:58:13,300 --> 01:58:15,200
Alright, so now
let's say our company

2893
01:58:15,200 --> 01:58:18,200
have collected a lot of data
for different 10 companies

2894
01:58:18,200 --> 01:58:20,000
and they want to do
some computation.

2895
01:58:20,000 --> 01:58:22,964
Let's say they want to compute
the average closing price.

2896
01:58:22,964 --> 01:58:26,300
They want to list the companies
with the highest closing prices.

2897
01:58:26,300 --> 01:58:29,749
They want to compute the average
closing price per month.

2898
01:58:29,749 --> 01:58:32,485
They want to list the number
of big price Rises

2899
01:58:32,485 --> 01:58:35,400
and fall and compute
some statistical correlation.

2900
01:58:35,400 --> 01:58:37,700
So these things we are going
to do with the help

2901
01:58:37,700 --> 01:58:39,158
of our spark SQL statement.

2902
01:58:39,158 --> 01:58:42,255
So this is a very common we want
to process the huge data.

2903
01:58:42,255 --> 01:58:45,103
We want to handle The input
from the multiple sources,

2904
01:58:45,103 --> 01:58:47,200
we want to process
the data in real time

2905
01:58:47,200 --> 01:58:48,754
and it should be easy to use.

2906
01:58:48,754 --> 01:58:50,488
It should not be
very complicated.

2907
01:58:50,488 --> 01:58:53,800
So all this requirement will be
handled by my spots equal right?

2908
01:58:53,800 --> 01:58:55,700
So that's the reason
we are going to use

2909
01:58:55,700 --> 01:58:56,950
the spacer sequence.

2910
01:58:56,950 --> 01:58:57,700
So as I said

2911
01:58:57,700 --> 01:58:59,600
that we are going
to use 10 companies.

2912
01:58:59,600 --> 01:59:02,076
So we are going to kind
of use this 10 companies

2913
01:59:02,076 --> 01:59:03,498
and on those ten companies.

2914
01:59:03,498 --> 01:59:04,500
We are going to see

2915
01:59:04,500 --> 01:59:07,200
that we are going to perform
our analysis on top of it.

2916
01:59:07,200 --> 01:59:09,100
So we will be using
this table data

2917
01:59:09,100 --> 01:59:11,800
from Yahoo finance
for all this following stocks.

2918
01:59:11,800 --> 01:59:14,300
So for n and a A bit sexist.

2919
01:59:14,300 --> 01:59:15,400
So all these companies

2920
01:59:15,400 --> 01:59:17,600
we have on on which we
are going to perform.

2921
01:59:17,600 --> 01:59:20,800
So this is how my data will look
like which will be having date

2922
01:59:20,800 --> 01:59:25,046
opening High rate low rate
closing volume adjusted close.

2923
01:59:25,046 --> 01:59:27,700
All this data will
be presented now.

2924
01:59:27,700 --> 01:59:28,917
So, let's see how we

2925
01:59:28,917 --> 01:59:31,900
can Implement a stock analysis
using spark sequel.

2926
01:59:31,900 --> 01:59:33,497
So what we have to do for that,

2927
01:59:33,497 --> 01:59:36,278
so this is how many data
flow diagram will sound like

2928
01:59:36,278 --> 01:59:38,811
so we have going to initially
have the huge amount

2929
01:59:38,811 --> 01:59:40,000
of real-time stock data

2930
01:59:40,000 --> 01:59:42,400
that we are going to process it
through this path SQL.

2931
01:59:42,400 --> 01:59:44,600
So going to It into
a named column base.

2932
01:59:44,600 --> 01:59:46,308
Then we are going
to create an rdd

2933
01:59:46,308 --> 01:59:47,658
for functional programming.

2934
01:59:47,658 --> 01:59:48,395
So let's do that.

2935
01:59:48,395 --> 01:59:50,354
Then we are going to use
a reverse Park sequel

2936
01:59:50,354 --> 01:59:52,500
which will calculate
the average closing price

2937
01:59:52,500 --> 01:59:53,600
for your calculating.

2938
01:59:53,600 --> 01:59:56,188
The company with is closing
per year then buy

2939
01:59:56,188 --> 01:59:59,000
some stock SQL queries
will be getting our outputs.

2940
01:59:59,000 --> 02:00:01,000
Okay, so that is
what we're going to do.

2941
02:00:01,000 --> 02:00:03,400
So all the queries
what we are getting generated,

2942
02:00:03,400 --> 02:00:05,500
so it's not only this we
are also going to compute

2943
02:00:05,500 --> 02:00:08,000
few other queries what we
have solve those queries.

2944
02:00:08,000 --> 02:00:09,200
We're going to execute him.

2945
02:00:09,200 --> 02:00:09,500
Now.

2946
02:00:09,500 --> 02:00:11,273
This is how the flow
will look like.

2947
02:00:11,273 --> 02:00:13,200
So we are going
to initially have this Data

2948
02:00:13,200 --> 02:00:16,000
what I have just shown you a now
what you're going to do.

2949
02:00:16,000 --> 02:00:17,700
You're going to create
a data frame you

2950
02:00:17,700 --> 02:00:19,990
are going to then create
a joint clothes are ready.

2951
02:00:19,990 --> 02:00:21,850
We will see what we
are going to do here.

2952
02:00:21,850 --> 02:00:23,900
Then we are going
to calculate the average

2953
02:00:23,900 --> 02:00:25,160
closing price per year.

2954
02:00:25,160 --> 02:00:27,900
We are going to hit
a rough patch SQL query and get

2955
02:00:27,900 --> 02:00:29,314
the result in the table.

2956
02:00:29,314 --> 02:00:31,800
So this is how my execution
will look like.

2957
02:00:31,800 --> 02:00:33,445
So what we are going
to do in this case,

2958
02:00:33,445 --> 02:00:34,095
first of all,

2959
02:00:34,095 --> 02:00:36,839
we are going to initialize the
Sparks equal in this function.

2960
02:00:36,839 --> 02:00:39,600
We are going to import all
the required libraries then we

2961
02:00:39,600 --> 02:00:40,500
are going to start

2962
02:00:40,500 --> 02:00:43,216
our spark session after
importing all the required.

2963
02:00:43,216 --> 02:00:44,473
B we are going to create

2964
02:00:44,473 --> 02:00:47,251
our case class whatever
is required in the case class,

2965
02:00:47,251 --> 02:00:49,466
you can notice a then
we are going to Define

2966
02:00:49,466 --> 02:00:50,600
our past stock scheme.

2967
02:00:50,600 --> 02:00:53,350
So because we have already
learnt how to create a schema

2968
02:00:53,350 --> 02:00:55,500
as we're going to create
this page table schema

2969
02:00:55,500 --> 02:00:56,800
by creating this way.

2970
02:00:56,800 --> 02:00:59,200
Well, then we are going
to Define our parts.

2971
02:00:59,200 --> 02:01:00,900
I DD so in parts are did

2972
02:01:00,900 --> 02:01:02,895
if you notice so
here we are creating.

2973
02:01:02,895 --> 02:01:04,289
This parts are ready mix.

2974
02:01:04,289 --> 02:01:05,708
We have going to create all

2975
02:01:05,708 --> 02:01:07,600
of that by using
this additive first.

2976
02:01:07,600 --> 02:01:10,300
We are going to remove
the header files also from it.

2977
02:01:10,300 --> 02:01:12,749
Then we are going
to read our CSV file

2978
02:01:12,749 --> 02:01:15,200
into Into stocks a a
on DF data frame.

2979
02:01:15,200 --> 02:01:17,500
So we are going to read
this as C dot txt file.

2980
02:01:17,500 --> 02:01:20,161
You can see we are reading
this file and we are going

2981
02:01:20,161 --> 02:01:21,800
to convert it into a data frame.

2982
02:01:21,800 --> 02:01:23,450
So we are passing
it as an oddity.

2983
02:01:23,450 --> 02:01:24,511
Once we are done then

2984
02:01:24,511 --> 02:01:26,697
if you want to print
the output we can do it

2985
02:01:26,697 --> 02:01:27,997
with the help of show API.

2986
02:01:27,997 --> 02:01:29,852
Once we are done
with this now we want

2987
02:01:29,852 --> 02:01:31,450
to let's say display the average

2988
02:01:31,450 --> 02:01:34,100
of addressing closing price
for n and for every month,

2989
02:01:34,100 --> 02:01:37,629
so if we can do all of that also
by using select query, right

2990
02:01:37,629 --> 02:01:40,300
so we can say this data frame
dot select and pass

2991
02:01:40,300 --> 02:01:43,100
whatever parameters are required
to get the average know,

2992
02:01:43,100 --> 02:01:44,000
You can notice are

2993
02:01:44,000 --> 02:01:47,200
inside this we are creating
the Elias of the things as well.

2994
02:01:47,200 --> 02:01:48,300
So for this DT,

2995
02:01:48,300 --> 02:01:50,059
we are creating
areas here, right?

2996
02:01:50,059 --> 02:01:52,538
So we are creating the Elias
for it in a binder

2997
02:01:52,538 --> 02:01:54,714
and we are showing
the output also so here

2998
02:01:54,714 --> 02:01:56,307
what we are going to do now,

2999
02:01:56,307 --> 02:01:57,400
we will be checking

3000
02:01:57,400 --> 02:01:59,669
that the closing
price for Microsoft.

3001
02:01:59,669 --> 02:02:03,300
So let's say they're going up
by 2 or with greater than 2

3002
02:02:03,300 --> 02:02:05,900
or wherever it is going
by greater than 2 and now we

3003
02:02:05,900 --> 02:02:08,039
want to get the output
and display the result

3004
02:02:08,039 --> 02:02:10,023
so you can notice
that wherever it is going

3005
02:02:10,023 --> 02:02:12,282
to be greater than 2 we
are getting the value.

3006
02:02:12,282 --> 02:02:14,383
So we are hitting
the SQL query to do that.

3007
02:02:14,383 --> 02:02:16,483
So we are hitting
the SQL query now on this

3008
02:02:16,483 --> 02:02:17,935
you can notice the SQL query

3009
02:02:17,935 --> 02:02:19,975
which we are hitting
on the stocks.

3010
02:02:19,975 --> 02:02:20,775
Msft.

3011
02:02:20,775 --> 02:02:21,128
Right?

3012
02:02:21,128 --> 02:02:22,768
This is the we have data frame

3013
02:02:22,768 --> 02:02:24,900
we have created now
on this we are doing

3014
02:02:24,900 --> 02:02:27,076
that and we are putting
our query that

3015
02:02:27,076 --> 02:02:29,395
where my condition
this to be true means

3016
02:02:29,395 --> 02:02:32,066
where my closing price
and my opening price

3017
02:02:32,066 --> 02:02:34,300
because let's say
at the closing price

3018
02:02:34,300 --> 02:02:36,852
the stock price by let's say
a hundred US Dollars

3019
02:02:36,852 --> 02:02:38,500
and at that time in the morning

3020
02:02:38,500 --> 02:02:40,800
when it open with
the Lexi 98 used or so,

3021
02:02:40,800 --> 02:02:43,131
wherever it is going
to be having a different.

3022
02:02:43,131 --> 02:02:43,961
Of to or greater

3023
02:02:43,961 --> 02:02:46,300
than to that only output
we want to get so that is

3024
02:02:46,300 --> 02:02:47,400
what we're doing here.

3025
02:02:47,400 --> 02:02:47,600
Now.

3026
02:02:47,600 --> 02:02:50,600
Once we are done then after that
what we are going to do now,

3027
02:02:50,600 --> 02:02:52,628
we are going to use
the join operation.

3028
02:02:52,629 --> 02:02:55,500
So what we are going to do so
we will be joining the Annan

3029
02:02:55,500 --> 02:02:58,300
and except bestop's in order
to compare the closing price

3030
02:02:58,300 --> 02:03:00,200
because we want
to compare the prices

3031
02:03:00,200 --> 02:03:01,297
so we will be doing that.

3032
02:03:01,297 --> 02:03:02,000
So first of all,

3033
02:03:02,000 --> 02:03:04,600
we are going to create a union
of all these stocks

3034
02:03:04,600 --> 02:03:06,500
and then display
this guy joint Rose.

3035
02:03:06,500 --> 02:03:07,259
So look at this

3036
02:03:07,259 --> 02:03:09,284
what we're going to do
we're going to use

3037
02:03:09,284 --> 02:03:10,200
the spark sequence and

3038
02:03:10,200 --> 02:03:13,000
if you notice this closely
what we're doing in this case,

3039
02:03:13,000 --> 02:03:14,439
So now in this park sequel,

3040
02:03:14,439 --> 02:03:16,200
we are hitting
the square is equal

3041
02:03:16,200 --> 02:03:18,780
and all those stuff then
we are saying from this

3042
02:03:18,780 --> 02:03:21,192
and here we are using
this joint operation

3043
02:03:21,192 --> 02:03:22,704
may see this join oppression.

3044
02:03:22,704 --> 02:03:24,500
So this we are joining it on

3045
02:03:24,500 --> 02:03:26,500
and then in the end
we are outputting it.

3046
02:03:26,500 --> 02:03:28,700
So here you can see
you can do a comparison

3047
02:03:28,700 --> 02:03:31,300
of all these clothes price
for all these talks.

3048
02:03:31,300 --> 02:03:34,000
You can also include no
for more companies right now.

3049
02:03:34,000 --> 02:03:36,280
We have just shown you
an example with to complete

3050
02:03:36,280 --> 02:03:38,480
but you can do it
for more companies as well.

3051
02:03:38,480 --> 02:03:39,188
Now in this case

3052
02:03:39,188 --> 02:03:41,800
if you notice what we're doing
were writing this in the park

3053
02:03:41,800 --> 02:03:44,928
a file format and Save Being
into this particular location.

3054
02:03:44,928 --> 02:03:47,135
So we are creating
this joint stock market.

3055
02:03:47,135 --> 02:03:49,869
So we are storing it as
a packet file format and here

3056
02:03:49,869 --> 02:03:51,705
if you want to read
it we can read

3057
02:03:51,705 --> 02:03:52,800
that and showed output

3058
02:03:52,800 --> 02:03:55,300
but whatever file you
have saved it as a pocket

3059
02:03:55,300 --> 02:03:57,900
while definitely you
will not be able to read that up

3060
02:03:57,900 --> 02:04:00,700
because that file is going
to be the perfect way

3061
02:04:00,800 --> 02:04:03,900
and park it way are the files
which you can never read.

3062
02:04:03,900 --> 02:04:05,900
You will not be able
to read them up now,

3063
02:04:05,900 --> 02:04:08,382
so you will be seeing this
average closing price per year.

3064
02:04:08,382 --> 02:04:10,631
I'm going to show you all
these things running also some

3065
02:04:10,631 --> 02:04:13,181
just right to explaining you
how things will be run.

3066
02:04:13,181 --> 02:04:13,900
We're doing up here.

3067
02:04:13,900 --> 02:04:15,900
So I will be showing
you all these things

3068
02:04:15,900 --> 02:04:17,100
in execution as well.

3069
02:04:17,200 --> 02:04:18,200
Now in this case,

3070
02:04:18,200 --> 02:04:20,100
if you notice
what we are doing again,

3071
02:04:20,100 --> 02:04:21,907
we are creating
our data frame here.

3072
02:04:21,907 --> 02:04:24,800
Again, we are executing our
query whatever table we have.

3073
02:04:24,800 --> 02:04:26,300
We are executing on top of it.

3074
02:04:26,300 --> 02:04:27,050
So in this case

3075
02:04:27,050 --> 02:04:29,650
because we want to find
the average closing per year.

3076
02:04:29,650 --> 02:04:31,300
So what we are doing
in this case,

3077
02:04:31,300 --> 02:04:33,800
we are going to create
a new table containing

3078
02:04:33,800 --> 02:04:37,700
the average closing price
of let's say an and fxn first

3079
02:04:37,700 --> 02:04:40,319
and then we are going
to display all this new table.

3080
02:04:40,319 --> 02:04:41,369
So we are in the end.

3081
02:04:41,369 --> 02:04:42,800
We are going to
register this table

3082
02:04:42,800 --> 02:04:43,900
or The temporary table

3083
02:04:43,900 --> 02:04:46,515
so that we can execute
our SQL queries on top of it.

3084
02:04:46,515 --> 02:04:47,328
So in this case,

3085
02:04:47,328 --> 02:04:49,828
you can notice that we
are creating this new table.

3086
02:04:49,828 --> 02:04:50,900
And in this new table,

3087
02:04:50,900 --> 02:04:52,900
we have putting
our SQL query right

3088
02:04:52,900 --> 02:04:53,711
that SQL query

3089
02:04:53,711 --> 02:04:56,300
is going to contains
the average closing Paso

3090
02:04:56,300 --> 02:05:00,100
the SQL queries finding out
the average closing price of N

3091
02:05:00,100 --> 02:05:03,100
and all these companies
then whatever we have now.

3092
02:05:03,100 --> 02:05:05,688
We are going to apply
the transformation step

3093
02:05:05,688 --> 02:05:07,488
not transformation
of this new table,

3094
02:05:07,488 --> 02:05:09,188
which we have created
with the year

3095
02:05:09,188 --> 02:05:11,100
and the corresponding
three company data

3096
02:05:11,100 --> 02:05:13,400
what we have created
into the The company

3097
02:05:13,400 --> 02:05:15,103
or table select
which you can notice

3098
02:05:15,103 --> 02:05:17,100
that we are creating
this company or table

3099
02:05:17,100 --> 02:05:18,247
and here first of all,

3100
02:05:18,247 --> 02:05:20,725
we are going to create
a transform table company

3101
02:05:20,725 --> 02:05:23,413
or and going to display
the output so you can notice

3102
02:05:23,413 --> 02:05:25,100
that we are hitting
the SQL query

3103
02:05:25,100 --> 02:05:27,900
and in the end we have printing
this output similarly

3104
02:05:27,900 --> 02:05:29,975
if we want to let's say
compute the best

3105
02:05:29,975 --> 02:05:31,597
of average close we can do that.

3106
02:05:31,597 --> 02:05:33,618
So in this case again
the same way now,

3107
02:05:33,618 --> 02:05:35,800
if once they have learned
the basic stuff,

3108
02:05:35,800 --> 02:05:37,426
you can notice that everything

3109
02:05:37,426 --> 02:05:40,400
is following a similar approach
now in this case also,

3110
02:05:40,400 --> 02:05:43,200
we want to find out let's say
the best of the average

3111
02:05:43,200 --> 02:05:46,100
So we are creating
this best company here now.

3112
02:05:46,100 --> 02:05:49,500
It should contain the best
average closing price of an MX

3113
02:05:49,500 --> 02:05:52,700
and first so we can just get
this greatest and all battery.

3114
02:05:52,700 --> 02:05:53,400
So we creating

3115
02:05:53,400 --> 02:05:56,675
that then after that we
are going to display this output

3116
02:05:56,675 --> 02:05:59,846
and we will be again registering
it as a temporary table now,

3117
02:05:59,846 --> 02:06:02,700
once we have done that then
we can hit our queries now,

3118
02:06:02,700 --> 02:06:04,350
so we want to check
let's say best

3119
02:06:04,350 --> 02:06:05,600
performing company per year.

3120
02:06:05,600 --> 02:06:07,200
Now what we have to do for that.

3121
02:06:07,200 --> 02:06:09,319
So we are creating
the final table in which

3122
02:06:09,319 --> 02:06:10,400
we are going to compute

3123
02:06:10,400 --> 02:06:13,200
all the things we are going
to perform the join or not.

3124
02:06:13,200 --> 02:06:16,082
So although SQL query we
are going to perform here

3125
02:06:16,082 --> 02:06:17,200
in order to compute

3126
02:06:17,200 --> 02:06:19,500
that which company
is doing the best

3127
02:06:19,500 --> 02:06:21,250
and then we are going
to display the output.

3128
02:06:21,250 --> 02:06:23,800
So this is what the output
is going showing up here.

3129
02:06:23,800 --> 02:06:25,850
We are again storing
as a comparative View

3130
02:06:25,850 --> 02:06:28,000
and here again the same
story of correlation

3131
02:06:28,000 --> 02:06:29,400
what we're going to do here.

3132
02:06:29,400 --> 02:06:32,843
So now we will be using
our statistics libraries to find

3133
02:06:32,843 --> 02:06:36,400
the correlation between Anand
epochs companies closing price.

3134
02:06:36,400 --> 02:06:38,300
So that is what we
are going to do now.

3135
02:06:38,300 --> 02:06:41,088
So correlation in finance
and the investment

3136
02:06:41,088 --> 02:06:43,079
and industries is a statistics.

3137
02:06:43,079 --> 02:06:44,300
Measures the degree

3138
02:06:44,300 --> 02:06:47,564
to which to Securities move
in relation to each other.

3139
02:06:47,564 --> 02:06:49,625
So the closer the correlation is

3140
02:06:49,625 --> 02:06:52,200
to be 1 this is going
to be a better one.

3141
02:06:52,200 --> 02:06:53,722
So it is always like

3142
02:06:53,722 --> 02:06:57,300
how to variables are correlated
with each other.

3143
02:06:57,300 --> 02:07:01,400
Let's say your H is highly
correlated to your salary,

3144
02:07:01,400 --> 02:07:05,000
but you're earning like
when you are young you usually

3145
02:07:05,000 --> 02:07:06,400
unless and when you

3146
02:07:06,400 --> 02:07:09,500
are more Edge definitely
you will be earning more

3147
02:07:09,500 --> 02:07:12,811
because you will be more mature
similar way I can say that.

3148
02:07:12,811 --> 02:07:16,400
Your salary is also dependent
on your education qualification.

3149
02:07:16,400 --> 02:07:18,815
And also on the premium
Institute from where you

3150
02:07:18,815 --> 02:07:20,149
have done your education.

3151
02:07:20,149 --> 02:07:21,751
Let's say if you are from IIT,

3152
02:07:21,751 --> 02:07:24,100
or I am definitely
your salary will be higher

3153
02:07:24,100 --> 02:07:25,300
from any other campuses.

3154
02:07:25,300 --> 02:07:26,100
Right Miss.

3155
02:07:26,100 --> 02:07:27,072
It's a probability.

3156
02:07:27,072 --> 02:07:28,300
We what I'm telling you.

3157
02:07:28,300 --> 02:07:28,900
So let's say

3158
02:07:28,900 --> 02:07:32,132
if I have to correlate now
in this case the education

3159
02:07:32,132 --> 02:07:35,600
and the salary but I can easily
create a correlation, right?

3160
02:07:35,600 --> 02:07:37,300
So that is
what the correlation go.

3161
02:07:37,300 --> 02:07:38,589
So we are going to do all

3162
02:07:38,589 --> 02:07:40,573
that with respect
to Overstock analysis.

3163
02:07:40,573 --> 02:07:41,869
Now now what we are doing

3164
02:07:41,869 --> 02:07:45,185
in this case, so You can notice
we are creating this series one

3165
02:07:45,185 --> 02:07:47,188
where we heading
the select query now,

3166
02:07:47,188 --> 02:07:49,401
we are mapping all
this an enclosed price.

3167
02:07:49,401 --> 02:07:52,400
We are converting to a DD
similar way for Series 2.

3168
02:07:52,400 --> 02:07:53,691
Also we are doing that right.

3169
02:07:53,691 --> 02:07:55,832
So this is we are doing
for rabbits or earlier.

3170
02:07:55,832 --> 02:07:58,600
We have done it for an enclosed
and then in the end we

3171
02:07:58,600 --> 02:08:00,911
are using the statistics
dot core to create

3172
02:08:00,911 --> 02:08:02,500
a correlation between them.

3173
02:08:02,600 --> 02:08:06,200
So you can notice this is how we
can execute everything now.

3174
02:08:06,200 --> 02:08:10,353
Let's go to our VM and see
everything in our execution.

3175
02:08:11,142 --> 02:08:12,757
Question from at all.

3176
02:08:12,900 --> 02:08:15,300
So this VM how we
will be getting you

3177
02:08:15,300 --> 02:08:17,659
will be getting all
this VM from a director.

3178
02:08:17,659 --> 02:08:19,815
So you need not worry
about all that but

3179
02:08:19,815 --> 02:08:21,930
that how I will be
getting all this p.m.

3180
02:08:21,930 --> 02:08:24,100
In a so a once you
enroll for the courses

3181
02:08:24,100 --> 02:08:27,300
and also you will be getting all
this came from that Erika said

3182
02:08:27,300 --> 02:08:28,541
so even if I am working

3183
02:08:28,541 --> 02:08:30,711
on Mac operating system
my VM will work.

3184
02:08:30,711 --> 02:08:32,300
Yes every operating system.

3185
02:08:32,300 --> 02:08:33,535
It will be supported.

3186
02:08:33,535 --> 02:08:35,592
So no trouble you
can just use any sort

3187
02:08:35,592 --> 02:08:38,428
of VM in all means
any operating system to do that.

3188
02:08:38,428 --> 02:08:41,000
So what I would occur do
is they just don't want

3189
02:08:41,000 --> 02:08:43,900
You to be troubled
in any sort of stuff here.

3190
02:08:43,900 --> 02:08:46,076
So what they do is
they kind of ensure

3191
02:08:46,076 --> 02:08:48,342
that whatever is required
for your practicals.

3192
02:08:48,342 --> 02:08:49,400
They take care of it.

3193
02:08:49,400 --> 02:08:51,700
That's the reason they
have created their own VM,

3194
02:08:51,700 --> 02:08:54,600
which is also going to be
a lower size and compassion

3195
02:08:54,600 --> 02:08:56,100
to Cloudera hortonworks VM

3196
02:08:56,100 --> 02:08:58,997
and this is going to definitely
be more helpful for you.

3197
02:08:58,997 --> 02:09:01,000
So all these things
will be provided to

3198
02:09:01,000 --> 02:09:02,524
you question from nothing.

3199
02:09:02,524 --> 02:09:05,900
So all this project I am going
to learn from the sessions.

3200
02:09:05,900 --> 02:09:06,200
Yes.

3201
02:09:06,200 --> 02:09:09,650
So once you enroll for so
right now whatever we have seen

3202
02:09:09,650 --> 02:09:13,100
definitely we have just Otten
upper level of view of this

3203
02:09:13,100 --> 02:09:15,350
how the session looks
like for a purchase.

3204
02:09:15,350 --> 02:09:18,700
But but when we actually teach
all these things in the course,

3205
02:09:18,700 --> 02:09:21,587
it's usually are much more
in the detailed format.

3206
02:09:21,587 --> 02:09:22,700
So in detail format,

3207
02:09:22,700 --> 02:09:25,300
we kind of keep on showing
you each step in detail

3208
02:09:25,300 --> 02:09:28,299
that how the things are working
even including the project.

3209
02:09:28,299 --> 02:09:30,900
So you will be also learning
with the help of project

3210
02:09:30,900 --> 02:09:32,157
on each different topic.

3211
02:09:32,157 --> 02:09:34,200
So that is the way
we kind of go for it.

3212
02:09:34,200 --> 02:09:36,605
Now if I am stuck
in any other project then

3213
02:09:36,605 --> 02:09:37,985
who will be helping me

3214
02:09:37,985 --> 02:09:40,308
so they will be
a support team 24 by 7

3215
02:09:40,308 --> 02:09:42,046
if Get stuck at any moment.

3216
02:09:42,046 --> 02:09:44,300
You need to just
give a call and kit

3217
02:09:44,300 --> 02:09:45,900
and a call or email.

3218
02:09:45,900 --> 02:09:49,076
There is a support ticket
and immediately the technical

3219
02:09:49,076 --> 02:09:52,100
team will be helping across
the support team is 24 by 7.

3220
02:09:52,100 --> 02:09:53,900
They are they are
all technical people

3221
02:09:53,900 --> 02:09:55,821
and they will be assisting
you across on all

3222
02:09:55,821 --> 02:09:58,100
that even the trainers
will be assisting you for any

3223
02:09:58,100 --> 02:10:00,000
of the technical query great.

3224
02:10:00,000 --> 02:10:00,400
Awesome.

3225
02:10:00,800 --> 02:10:01,900
Thank you now.

3226
02:10:01,900 --> 02:10:03,700
So if you notice this is my data

3227
02:10:03,700 --> 02:10:06,446
we have we were executing
all the things on this data.

3228
02:10:06,446 --> 02:10:08,726
Now what we want to do
if you notice this is

3229
02:10:08,726 --> 02:10:10,900
the same code which I
have just shown you.

3230
02:10:10,900 --> 02:10:13,800
Earlier also now let us
just execute this code.

3231
02:10:13,800 --> 02:10:15,481
So in order to execute this

3232
02:10:15,481 --> 02:10:18,345
what we can do we can connect
to my spa action.

3233
02:10:18,345 --> 02:10:20,400
So let's get
connected to suction.

3234
02:10:21,700 --> 02:10:23,970
Someone's will be connected
to Spur action.

3235
02:10:23,970 --> 02:10:25,382
We will go step by step.

3236
02:10:25,382 --> 02:10:27,700
So first we will be
importing our package.

3237
02:10:31,400 --> 02:10:34,861
This take some time let
it just get connected.

3238
02:10:36,300 --> 02:10:38,400
Once this is connected now,

3239
02:10:38,400 --> 02:10:39,400
you can notice

3240
02:10:39,400 --> 02:10:42,400
that I'm just importing all
the all the important libraries

3241
02:10:42,400 --> 02:10:44,400
we have already
learned about that.

3242
02:10:45,800 --> 02:10:49,137
After that, you will be
initialising your spark session.

3243
02:10:49,137 --> 02:10:49,805
So let's do

3244
02:10:49,805 --> 02:10:52,900
that again the same steps
what you have done before.

3245
02:10:58,600 --> 02:10:59,922
Once we will be done.

3246
02:10:59,922 --> 02:11:02,000
We will be creating
a stock class.

3247
02:11:07,000 --> 02:11:09,900
We could have also directly
executed from Eclipse.

3248
02:11:09,900 --> 02:11:11,400
Also, this is just I want

3249
02:11:11,400 --> 02:11:13,800
to show you step-by-step
whatever we have learnt.

3250
02:11:13,800 --> 02:11:15,700
So now you can see
for company one and then

3251
02:11:15,700 --> 02:11:16,700
if you want to do

3252
02:11:16,700 --> 02:11:20,000
some computation we want to even
see the values and all right,

3253
02:11:20,000 --> 02:11:21,600
so that's what we're doing here.

3254
02:11:21,700 --> 02:11:24,700
So if we are just getting
the files creating another did,

3255
02:11:24,700 --> 02:11:26,800
you know, so let's execute this.

3256
02:11:28,500 --> 02:11:31,200
Similarly for your a back
similarly for your fast

3257
02:11:31,200 --> 02:11:34,050
for all this so I'm just copying
all these things together

3258
02:11:34,050 --> 02:11:36,100
because there are a lot
of companies for which we

3259
02:11:36,100 --> 02:11:37,400
have to do all this step.

3260
02:11:37,400 --> 02:11:39,625
So let's bring it
for all the 10 companies

3261
02:11:39,625 --> 02:11:41,200
which we are going to create.

3262
02:11:49,000 --> 02:11:49,900
So as you can see,

3263
02:11:49,900 --> 02:11:52,400
this print scheme has giving
it output right now.

3264
02:11:52,400 --> 02:11:52,900
Similarly.

3265
02:11:52,900 --> 02:11:55,800
I can execute for a rest
of the things as well.

3266
02:11:55,800 --> 02:11:57,800
So this is just giving
you the similar way.

3267
02:11:57,800 --> 02:12:01,702
All the outputs will be shown
up here company for company V

3268
02:12:01,702 --> 02:12:05,000
all these companies you
can see this in execution.

3269
02:12:08,000 --> 02:12:11,000
After that, we will be creating
our temporary view

3270
02:12:11,000 --> 02:12:13,800
so that we can execute
our SQL queries.

3271
02:12:16,500 --> 02:12:19,700
So let's do it for complaint
and also then after that we

3272
02:12:19,700 --> 02:12:22,900
can just create a work all
over temporary table for it.

3273
02:12:22,900 --> 02:12:25,200
Once we are done now
we can do our queries.

3274
02:12:25,200 --> 02:12:27,357
Like let's say we
can display the average

3275
02:12:27,357 --> 02:12:30,000
of existing closing price
for and and for each one

3276
02:12:30,000 --> 02:12:31,400
so we can hit this query.

3277
02:12:34,700 --> 02:12:37,500
So all these queries will happen
on your temporary view

3278
02:12:37,600 --> 02:12:39,800
because we cannot anyway
to all these queries

3279
02:12:39,800 --> 02:12:41,471
on our data frames are out

3280
02:12:41,471 --> 02:12:44,300
so you can see this this
is getting executed.

3281
02:12:45,500 --> 02:12:49,200
Trying it out to Tulsa now
because they've done dot shoe.

3282
02:12:49,200 --> 02:12:51,237
That's the reason
you're getting this output.

3283
02:12:51,237 --> 02:12:51,700
Similarly.

3284
02:12:51,700 --> 02:12:55,600
If we want to let's say list
the closing price for msft

3285
02:12:55,600 --> 02:12:57,600
which went up more than $2 way.

3286
02:12:57,600 --> 02:12:58,794
So that query also we

3287
02:12:58,794 --> 02:13:02,500
can execute now we have already
understood this query in detail.

3288
02:13:03,100 --> 02:13:05,300
It is seeing is
execution partner

3289
02:13:05,500 --> 02:13:08,100
so that you can appreciate
whatever you have learned.

3290
02:13:08,300 --> 02:13:10,700
See this is the output
showing up to you.

3291
02:13:10,800 --> 02:13:12,300
Now after that

3292
02:13:12,300 --> 02:13:15,723
how you can join all the stack
closing price right similar way

3293
02:13:15,723 --> 02:13:18,966
how we can save the joint view
in the packet for table.

3294
02:13:18,966 --> 02:13:20,435
You want to read that back.

3295
02:13:20,435 --> 02:13:22,157
You want to create a new table

3296
02:13:22,157 --> 02:13:25,275
like so let's execute all
these three queries together

3297
02:13:25,275 --> 02:13:27,100
because we have
already seen this.

3298
02:13:29,700 --> 02:13:30,502
Look at this.

3299
02:13:30,502 --> 02:13:31,800
So this in this case,

3300
02:13:31,800 --> 02:13:34,300
we are doing the drawing class
basing this output.

3301
02:13:34,300 --> 02:13:36,499
Then we want to save it
in the package files.

3302
02:13:36,499 --> 02:13:39,100
We are saving it and we want
to again reiterate back.

3303
02:13:39,100 --> 02:13:40,893
Then we are creating
our new table, right?

3304
02:13:40,893 --> 02:13:42,043
We were doing that join

3305
02:13:42,043 --> 02:13:44,200
and on so that is
what we are doing in this case.

3306
02:13:44,200 --> 02:13:45,900
Then you want
to see this output.

3307
02:13:47,700 --> 02:13:50,400
Then we are against touring
as a temp table or not.

3308
02:13:50,499 --> 02:13:50,700
Now.

3309
02:13:50,700 --> 02:13:53,700
Once we are done with this step
also then what so we

3310
02:13:53,700 --> 02:13:55,400
have done it in Step 6.

3311
02:13:55,400 --> 02:13:56,900
Now we want to perform.

3312
02:13:56,900 --> 02:13:58,488
Let's have a transformation

3313
02:13:58,488 --> 02:14:01,000
on new table corresponding
to the three companies

3314
02:14:01,000 --> 02:14:03,411
so that we can compare
we want to create

3315
02:14:03,411 --> 02:14:06,305
the best company containing
the best average closing price

3316
02:14:06,305 --> 02:14:07,748
for all these three companies.

3317
02:14:07,748 --> 02:14:09,300
We want to find the companies

3318
02:14:09,300 --> 02:14:11,600
but the best closing
price average per year.

3319
02:14:11,600 --> 02:14:13,200
So let's do all that as well.

3320
02:14:18,800 --> 02:14:22,343
So you can see best company
of the year now here also

3321
02:14:22,343 --> 02:14:26,500
the same stuff we are doing to
be registering over temp table.

3322
02:14:34,100 --> 02:14:35,700
Okay, so there's a mistake here.

3323
02:14:35,700 --> 02:14:38,096
So if you notice here it is 1

3324
02:14:38,100 --> 02:14:40,722
but here we are doing
a show of all right,

3325
02:14:40,722 --> 02:14:42,129
so there is a mistake.

3326
02:14:42,129 --> 02:14:43,600
I'm just correcting it.

3327
02:14:45,000 --> 02:14:48,300
So here also it should be
1 I'm just updating

3328
02:14:48,300 --> 02:14:51,300
in the sheet itself so
that it will start working now.

3329
02:14:51,300 --> 02:14:53,102
So here I have just made it one.

3330
02:14:53,102 --> 02:14:55,300
So now after that it
will start working.

3331
02:14:55,300 --> 02:14:59,600
Okay, wherever it is going
to be all I have to make it one.

3332
02:15:00,400 --> 02:15:03,500
So that is the change
which I need to do here also.

3333
02:15:04,400 --> 02:15:06,700
And you will notice
it will start working.

3334
02:15:06,900 --> 02:15:09,433
So here also you
need to make it one.

3335
02:15:09,433 --> 02:15:10,748
So all those places

3336
02:15:10,748 --> 02:15:14,363
where ever it was so just
kind of a good point to make

3337
02:15:14,363 --> 02:15:18,388
so wherever you are working
on this we need to always ensure

3338
02:15:18,388 --> 02:15:21,800
that all these values
what you are putting up here.

3339
02:15:21,800 --> 02:15:25,900
Okay, so I could have also
done it like this one second.

3340
02:15:26,300 --> 02:15:27,876
In fact in this place.

3341
02:15:27,876 --> 02:15:30,600
I need not do all
this step one second.

3342
02:15:30,600 --> 02:15:33,842
Let me explain you also
why no in this place.

3343
02:15:33,842 --> 02:15:37,600
It's So see from here
this error started opening why

3344
02:15:37,600 --> 02:15:38,758
because my data frame

3345
02:15:38,758 --> 02:15:40,500
what I have created
here most one.

3346
02:15:40,500 --> 02:15:41,500
Let's execute it.

3347
02:15:41,500 --> 02:15:43,500
Now, you will notice
this Quest artwork.

3348
02:15:44,340 --> 02:15:45,659
See this is working.

3349
02:15:46,000 --> 02:15:46,300
Now.

3350
02:15:46,300 --> 02:15:47,000
After that.

3351
02:15:47,000 --> 02:15:49,493
I am creating a temp table
that temp table.

3352
02:15:49,493 --> 02:15:52,400
What we are creating is
let's say company on okay.

3353
02:15:52,400 --> 02:15:55,100
So this is the temp table
which we have created.

3354
02:15:55,100 --> 02:15:57,808
You can see this company
now in this case

3355
02:15:57,808 --> 02:16:01,300
if I am keeping this company
on itself it is going to work.

3356
02:16:02,000 --> 02:16:03,195
Because here anyway,

3357
02:16:03,195 --> 02:16:05,897
I'm going to use
the whatever temporary table

3358
02:16:05,897 --> 02:16:07,310
we have created, right?

3359
02:16:07,310 --> 02:16:08,600
So now let's execute.

3360
02:16:10,800 --> 02:16:12,700
So you can see now
it started book.

3361
02:16:14,000 --> 02:16:15,900
No further to that now,

3362
02:16:15,900 --> 02:16:18,500
we want to create
a correlation between them

3363
02:16:18,500 --> 02:16:19,600
so we can do that.

3364
02:16:23,700 --> 02:16:26,400
See this is going to give
me the correlation

3365
02:16:26,400 --> 02:16:30,500
between the two column names
and so that we can see here.

3366
02:16:30,700 --> 02:16:34,445
So this is the correlation the
more it is closer to 1 means the

3367
02:16:34,445 --> 02:16:37,950
better it is it means definitely
it is near to 1 it is 0.9,

3368
02:16:37,950 --> 02:16:39,400
which is a bigger value.

3369
02:16:39,400 --> 02:16:42,700
So definitely it is going
to be much they both are

3370
02:16:42,700 --> 02:16:45,700
highly correlated means
definitely they are impacting

3371
02:16:45,700 --> 02:16:47,300
each other stock price.

3372
02:16:47,400 --> 02:16:49,700
So this is all about the project

3373
02:16:49,700 --> 02:16:58,500
but Welcome to this interesting
session of spots remaining

3374
02:16:58,673 --> 02:16:59,826
from and Erica.

3375
02:17:00,800 --> 02:17:02,261
What is pathogenic?

3376
02:17:02,261 --> 02:17:04,415
Is it like really important?

3377
02:17:04,500 --> 02:17:05,400
Definitely?

3378
02:17:05,400 --> 02:17:05,704
Yes.

3379
02:17:05,704 --> 02:17:07,001
Is it really hot?

3380
02:17:07,001 --> 02:17:07,600
Definitely?

3381
02:17:07,600 --> 02:17:08,100
Yes.

3382
02:17:08,100 --> 02:17:10,900
That's the reason we
are learning this technology.

3383
02:17:10,900 --> 02:17:14,600
And this is one of the very
sort things in the market

3384
02:17:14,600 --> 02:17:16,272
when it's a hot thing means

3385
02:17:16,272 --> 02:17:18,750
in terms of job market
I'm talking about.

3386
02:17:18,750 --> 02:17:21,600
So let's see what will be
our agenda for today.

3387
02:17:21,900 --> 02:17:25,500
So we are going to Gus
about spark ecosystem

3388
02:17:25,500 --> 02:17:27,900
where we are going
to see that okay,

3389
02:17:27,900 --> 02:17:28,700
what is pop

3390
02:17:28,700 --> 02:17:32,100
how smarts the main threats
in the West Park ecosystem

3391
02:17:32,100 --> 02:17:35,631
wise path streaming we
are going to have overview

3392
02:17:35,631 --> 02:17:39,900
of stock streaming kind of
getting into the basics of that.

3393
02:17:39,900 --> 02:17:41,832
We will learn about these cream.

3394
02:17:41,832 --> 02:17:44,890
We will learn also about
these theme Transformations.

3395
02:17:44,890 --> 02:17:46,800
We will be
learning about caching

3396
02:17:46,800 --> 02:17:51,200
and persistence accumulators
broadcast variables checkpoints.

3397
02:17:51,200 --> 02:17:53,600
These are like Advanced
concept of paths.

3398
02:17:54,100 --> 02:17:55,600
And then in the end,

3399
02:17:55,600 --> 02:17:59,900
we will walk through a use case
of Twitter sentiment analysis.

3400
02:18:00,500 --> 02:18:04,700
Now, what is streaming
let's understand that.

3401
02:18:04,800 --> 02:18:08,000
So let me start
by us example to you.

3402
02:18:08,600 --> 02:18:12,300
So let's see if there is
a bank and in Bank.

3403
02:18:12,500 --> 02:18:13,082
Definitely.

3404
02:18:13,082 --> 02:18:14,200
I'm pretty sure all

3405
02:18:14,200 --> 02:18:18,700
of you must have views credit
card debit card all those karts

3406
02:18:18,700 --> 02:18:20,900
what dance provide now,

3407
02:18:20,900 --> 02:18:23,500
let's say you
have done a transaction.

3408
02:18:23,500 --> 02:18:27,300
From India just now
and within an art

3409
02:18:27,300 --> 02:18:30,260
and edit your card
is getting swept in u.s.

3410
02:18:30,260 --> 02:18:31,600
Is it even possible

3411
02:18:31,600 --> 02:18:35,801
for your car to vision
and arduous definitely know now

3412
02:18:35,900 --> 02:18:38,100
how that bank will realize

3413
02:18:38,700 --> 02:18:41,000
that it is a fraud connection

3414
02:18:41,000 --> 02:18:44,600
because Bank cannot let
that transition happen.

3415
02:18:44,700 --> 02:18:46,238
They need to stop it

3416
02:18:46,238 --> 02:18:49,771
at the time of when it
is getting swiped either.

3417
02:18:49,771 --> 02:18:51,000
You can block it.

3418
02:18:51,000 --> 02:18:52,800
Give a call to you ask you

3419
02:18:52,800 --> 02:18:55,394
whether It is a genuine
transaction or not.

3420
02:18:55,394 --> 02:18:57,000
Do something of that sort.

3421
02:18:57,692 --> 02:18:58,000
Now.

3422
02:18:58,000 --> 02:19:00,300
Do you think they will put
some manual person

3423
02:19:00,300 --> 02:19:01,127
behind the scene

3424
02:19:01,127 --> 02:19:03,300
that will be looking
at all the transaction

3425
02:19:03,300 --> 02:19:05,100
and you will block it manually.

3426
02:19:05,100 --> 02:19:08,315
No, so they require
something of the sort

3427
02:19:08,315 --> 02:19:11,100
where the data will
be getting stream.

3428
02:19:11,100 --> 02:19:12,500
And at the real time

3429
02:19:12,500 --> 02:19:16,113
they should be able to catch
with the help of some pattern.

3430
02:19:16,113 --> 02:19:17,851
They will do some processing

3431
02:19:17,851 --> 02:19:20,575
and they will get
some pattern out of it with

3432
02:19:20,575 --> 02:19:23,305
if it is not sounding
like a genuine transition.

3433
02:19:23,305 --> 02:19:26,649
They will immediately add
a block it I'll give you a call

3434
02:19:26,649 --> 02:19:28,565
maybe send me an OTP to confirm

3435
02:19:28,565 --> 02:19:31,100
whether it's a genuine
connection dot they

3436
02:19:31,100 --> 02:19:32,050
will not wait

3437
02:19:32,050 --> 02:19:36,000
till the next day to kind of
complete that transaction.

3438
02:19:36,000 --> 02:19:38,941
Otherwise if what happened
nobody is going to touch

3439
02:19:38,941 --> 02:19:40,000
that that right.

3440
02:19:40,000 --> 02:19:43,000
So that is the how we
work on stomach.

3441
02:19:43,100 --> 02:19:46,300
Now someone have mentioned

3442
02:19:46,500 --> 02:19:51,400
that without stream processing
of data is not even possible.

3443
02:19:51,400 --> 02:19:52,435
In fact, we can see

3444
02:19:52,435 --> 02:19:55,200
that there is no And big data
which is possible.

3445
02:19:55,200 --> 02:19:57,900
We cannot even talk
about internet of things.

3446
02:19:57,900 --> 02:20:00,800
Right and this this is
a very famous statement

3447
02:20:00,800 --> 02:20:01,900
from Donna Saint

3448
02:20:01,900 --> 02:20:05,600
do from C equals
3 lot of companies

3449
02:20:05,700 --> 02:20:13,500
like YouTube Netflix Facebook
Twitter iTunes topped Pandora.

3450
02:20:13,769 --> 02:20:17,230
All these companies
are using spark screaming.

3451
02:20:17,700 --> 02:20:18,100
Now.

3452
02:20:19,100 --> 02:20:20,400
What is this?

3453
02:20:20,400 --> 02:20:23,580
We have just seen with an
example to kind of got an idea.

3454
02:20:23,580 --> 02:20:25,000
Idea about steaming pack.

3455
02:20:25,100 --> 02:20:30,300
Now as I said with the time
growing with the internet doing

3456
02:20:30,453 --> 02:20:35,146
these three main Technologies
are becoming popular day by day.

3457
02:20:35,500 --> 02:20:39,300
It's a technique
to transfer the data

3458
02:20:39,500 --> 02:20:45,000
so that it can be processed
as a steady and continuous

3459
02:20:45,000 --> 02:20:47,000
drip means immediately

3460
02:20:47,000 --> 02:20:49,500
as and when the data is coming

3461
02:20:49,600 --> 02:20:52,900
you are continuously
processing it as well.

3462
02:20:53,600 --> 02:20:54,400
In fact,

3463
02:20:54,400 --> 02:20:58,938
this real-time streaming is
what is driving to this big data

3464
02:20:59,100 --> 02:21:02,000
and also internet of things now,

3465
02:21:02,000 --> 02:21:04,786
they will be lot of things
like fundamental unit

3466
02:21:04,786 --> 02:21:06,387
of streaming media streams.

3467
02:21:06,387 --> 02:21:08,700
We will also be
Transforming Our screen.

3468
02:21:08,700 --> 02:21:09,700
We will be doing it.

3469
02:21:09,700 --> 02:21:10,994
In fact, the companies

3470
02:21:10,994 --> 02:21:13,400
are using it with
their business intelligence.

3471
02:21:13,400 --> 02:21:16,200
We will see more details
in further of the slides.

3472
02:21:16,300 --> 02:21:20,900
But before that we will be
talking about spark ecosystem

3473
02:21:21,200 --> 02:21:23,500
when we talk about Spark mmm,

3474
02:21:23,500 --> 02:21:25,653
there are multiple libraries

3475
02:21:25,653 --> 02:21:29,565
which are present in a first one
is pop frequent now

3476
02:21:29,565 --> 02:21:31,100
in spark SQL is like

3477
02:21:31,100 --> 02:21:35,000
when you can SQL Developer
can write the query in SQL way

3478
02:21:35,000 --> 02:21:38,600
and it is going to get converted
into a spark way

3479
02:21:38,600 --> 02:21:42,828
and then going to give you
output kind of analogous to hide

3480
02:21:42,828 --> 02:21:46,400
but it is going to be faster
in comparison to hide

3481
02:21:46,400 --> 02:21:48,252
when we talk about sports clinic

3482
02:21:48,252 --> 02:21:50,900
that is what we are going
to learn it is going

3483
02:21:50,900 --> 02:21:55,300
to enable all the analytical
and Practical applications

3484
02:21:55,600 --> 02:21:59,400
for your live
streaming data M11.

3485
02:21:59,700 --> 02:22:02,400
Ml it is mostly
for machine learning.

3486
02:22:02,400 --> 02:22:03,546
And in fact,

3487
02:22:03,546 --> 02:22:06,007
the interesting part
about MLA is

3488
02:22:06,200 --> 02:22:11,100
that it is completely replacing
mom invited are almost replaced.

3489
02:22:11,100 --> 02:22:13,500
Now all the core contributors

3490
02:22:13,500 --> 02:22:17,700
of Mahal have moved
in two words the

3491
02:22:18,184 --> 02:22:19,800
towards the MLF thing

3492
02:22:19,800 --> 02:22:23,500
because of the faster response
performance is really good.

3493
02:22:23,500 --> 02:22:26,707
In MLA Graphics Graphics.

3494
02:22:26,707 --> 02:22:27,005
Okay.

3495
02:22:27,005 --> 02:22:29,794
Let me give you example
everybody must have used

3496
02:22:29,794 --> 02:22:31,100
Google Maps right now.

3497
02:22:31,100 --> 02:22:34,082
What you doing Google Map
you search for the path.

3498
02:22:34,082 --> 02:22:36,600
You put your Source you
put your destination.

3499
02:22:36,600 --> 02:22:38,900
Now when you just
search for the part,

3500
02:22:39,000 --> 02:22:40,500
it's certainly different paths

3501
02:22:40,800 --> 02:22:45,100
and then provide you
an optimal path right now

3502
02:22:45,300 --> 02:22:47,300
how it providing
the optimal party.

3503
02:22:47,300 --> 02:22:50,500
These things can be done
with the help of Graphics.

3504
02:22:50,500 --> 02:22:53,500
So wherever you can create
a kind of a graphical stuff.

3505
02:22:53,500 --> 02:22:54,500
Up, we will say

3506
02:22:54,500 --> 02:22:56,997
that we can use
Graphics spark up.

3507
02:22:56,997 --> 02:22:57,300
Now.

3508
02:22:57,300 --> 02:23:00,600
This is the kind
of a package provided for art.

3509
02:23:00,600 --> 02:23:02,538
So R is of Open Source,

3510
02:23:02,538 --> 02:23:05,000
which is mostly used by analysts

3511
02:23:05,000 --> 02:23:08,300
and now spark committee
won't infect all

3512
02:23:08,300 --> 02:23:11,594
the analysts kind of to move
towards the sparkling water.

3513
02:23:11,594 --> 02:23:12,900
And that's the reason

3514
02:23:12,900 --> 02:23:15,615
they have recently
stopped supporting spark

3515
02:23:15,615 --> 02:23:17,226
on we are all the analysts

3516
02:23:17,226 --> 02:23:20,301
can now execute the query
using spark environment

3517
02:23:20,301 --> 02:23:22,800
that's getting better
performance and we

3518
02:23:22,800 --> 02:23:25,000
can also work on Big Data.

3519
02:23:25,200 --> 02:23:27,800
That's that's all
about the ecosystem point

3520
02:23:27,800 --> 02:23:31,061
below this we are going to have
a core engine for engine

3521
02:23:31,061 --> 02:23:34,500
is the one which defines all
the basics of the participants

3522
02:23:34,500 --> 02:23:36,363
all the RGV related stuff

3523
02:23:36,363 --> 02:23:38,600
and not is going to be defined

3524
02:23:38,600 --> 02:23:43,300
in your staff for Engine
moving further now,

3525
02:23:43,300 --> 02:23:46,227
so as we have just
discussed this part we

3526
02:23:46,227 --> 02:23:49,767
are going to now discuss
past screaming indicate

3527
02:23:49,767 --> 02:23:53,500
which is going to enable
analytical and Interactive.

3528
02:23:53,600 --> 02:23:58,300
For live streaming data
know Y is positive

3529
02:23:58,800 --> 02:24:01,400
if I talk about bias
past him indefinitely.

3530
02:24:01,400 --> 02:24:04,230
We have just gotten after
different is very important.

3531
02:24:04,230 --> 02:24:06,100
That's the reason
we are learning it

3532
02:24:06,200 --> 02:24:09,804
but this is so powerful
that it is used now

3533
02:24:09,804 --> 02:24:14,169
for the by lot of companies
to perform their marketing they

3534
02:24:14,169 --> 02:24:15,900
kind of getting an idea

3535
02:24:15,900 --> 02:24:18,250
that what a customer
is looking for.

3536
02:24:18,250 --> 02:24:22,094
In fact, we are going to learn
a use case of similar to that

3537
02:24:22,094 --> 02:24:24,700
where we are going
to to use pasta me now

3538
02:24:24,700 --> 02:24:28,283
where we are going to use
a Twitter sentimental analysis,

3539
02:24:28,283 --> 02:24:31,100
which can be used
for your crisis management.

3540
02:24:31,100 --> 02:24:33,680
Maybe you want to check
all your products

3541
02:24:33,680 --> 02:24:35,100
on our behave service.

3542
02:24:35,100 --> 02:24:37,420
I just think target marketing

3543
02:24:37,500 --> 02:24:40,342
by all the companies
around the world.

3544
02:24:40,342 --> 02:24:42,800
This is getting used
in this way.

3545
02:24:42,817 --> 02:24:46,355
And that's the reason
spark steaming is gaining

3546
02:24:46,355 --> 02:24:50,432
the popularity and because
of its performance as well.

3547
02:24:50,600 --> 02:24:53,200
It is beeping
on other platforms.

3548
02:24:53,600 --> 02:24:57,400
At the moment
now moving further.

3549
02:24:57,600 --> 02:25:01,300
Let's eat Sparks training
features when we talk

3550
02:25:01,300 --> 02:25:03,300
about Sparks training teachers.

3551
02:25:03,400 --> 02:25:05,100
It's very easy to scale.

3552
02:25:05,100 --> 02:25:07,420
You can scale
to even multiple nodes

3553
02:25:07,420 --> 02:25:11,083
which can even run till hundreds
of most speed is going

3554
02:25:11,083 --> 02:25:14,000
to be very quick means
in a very short time.

3555
02:25:14,000 --> 02:25:17,900
You can scream as well as
processor data soil tolerant,

3556
02:25:17,900 --> 02:25:19,300
even it made sure

3557
02:25:19,300 --> 02:25:23,100
that even you're not losing
your data integration.

3558
02:25:23,100 --> 02:25:26,600
You with your bash time and
real-time processing is possible

3559
02:25:26,600 --> 02:25:30,446
and it can also be used
for your business analytics

3560
02:25:30,500 --> 02:25:34,800
which is used to track
the behavior of your customer.

3561
02:25:34,900 --> 02:25:38,700
So as you can see this
is super polite and it's

3562
02:25:38,700 --> 02:25:43,000
like we are kind of getting to
know so many interesting things

3563
02:25:43,000 --> 02:25:48,000
about this pasta me now next
quickly have an overview

3564
02:25:48,000 --> 02:25:50,900
so that we can get
some basics of spots.

3565
02:25:50,900 --> 02:25:53,200
Don't know let's understand.

3566
02:25:53,200 --> 02:25:54,300
Which box?

3567
02:25:55,100 --> 02:25:59,200
So as we have just discussed it
is for real-time streaming data.

3568
02:25:59,600 --> 02:26:04,100
It is useful addition
in your spark for API.

3569
02:26:04,100 --> 02:26:06,500
So we have already seen
at the base level.

3570
02:26:06,500 --> 02:26:07,400
We have that spark

3571
02:26:07,400 --> 02:26:10,700
or in our ecosystem on top
of that we have passed we

3572
02:26:10,700 --> 02:26:14,700
will impact Sparks claiming
is kind of adding a lot

3573
02:26:14,700 --> 02:26:18,000
of advantage to spark Community

3574
02:26:18,000 --> 02:26:22,349
because a lot of people are only
joining spark Community to kind

3575
02:26:22,349 --> 02:26:23,800
of use this pasta me.

3576
02:26:23,800 --> 02:26:25,000
It's so powerful.

3577
02:26:25,000 --> 02:26:26,344
Everyone wants to come

3578
02:26:26,344 --> 02:26:29,478
and want to use it
because all the other Frameworks

3579
02:26:29,478 --> 02:26:30,809
which we already have

3580
02:26:30,809 --> 02:26:33,469
which are existing are
not as good in terms

3581
02:26:33,469 --> 02:26:34,783
of performance in all

3582
02:26:34,783 --> 02:26:36,311
and and it's the easiness

3583
02:26:36,311 --> 02:26:38,482
of moving Sparks
coming is also great

3584
02:26:38,482 --> 02:26:41,482
if you compare your program
for let's say two orbits

3585
02:26:41,482 --> 02:26:44,100
from which is used
for real-time processing.

3586
02:26:44,100 --> 02:26:46,356
You will notice
that it is much easier

3587
02:26:46,356 --> 02:26:49,100
in terms of from
a developer point of your ass

3588
02:26:49,100 --> 02:26:52,400
that that's the reason a lot
of regular showing interest

3589
02:26:52,400 --> 02:26:53,800
in this domain now,

3590
02:26:53,800 --> 02:26:56,800
it will also enable Table
of high throughput

3591
02:26:56,800 --> 02:26:58,187
and fault-tolerant

3592
02:26:58,187 --> 02:27:02,725
so that you to stream your data
to process all the things up

3593
02:27:02,900 --> 02:27:06,900
and the fundamental unit
Force past dreaming is going

3594
02:27:06,900 --> 02:27:08,200
to be District.

3595
02:27:08,300 --> 02:27:09,700
What is this thing?

3596
02:27:09,700 --> 02:27:10,600
Let me explain it.

3597
02:27:11,100 --> 02:27:14,200
So this dream is
basically a series

3598
02:27:14,200 --> 02:27:18,900
of bodies to process
the real-time data.

3599
02:27:19,400 --> 02:27:21,100
What we generally do is

3600
02:27:21,100 --> 02:27:23,678
if you look
at this light inside you

3601
02:27:23,678 --> 02:27:25,300
when you get the data,

3602
02:27:25,400 --> 02:27:29,800
It is a continuous data you
divide it in two batches

3603
02:27:29,800 --> 02:27:31,200
of input data.

3604
02:27:31,400 --> 02:27:35,700
We are going to call it
as micro batch and then

3605
02:27:35,700 --> 02:27:39,447
we are going to get that is
of processed data though.

3606
02:27:39,447 --> 02:27:40,600
It is real time.

3607
02:27:40,600 --> 02:27:42,300
But still how come it is back

3608
02:27:42,300 --> 02:27:44,547
because definitely you
are doing processing

3609
02:27:44,547 --> 02:27:46,258
on some part of the data, right?

3610
02:27:46,258 --> 02:27:48,300
Even if it is coming
at real time.

3611
02:27:48,300 --> 02:27:52,500
And that is what we are going
to call it as micro batch.

3612
02:27:53,600 --> 02:27:55,700
Moving further now.

3613
02:27:56,600 --> 02:27:59,100
Let's see few more
details on it.

3614
02:27:59,223 --> 02:28:02,300
Now from where you
can get all your data.

3615
02:28:02,300 --> 02:28:04,600
What can be your
data sources here.

3616
02:28:04,600 --> 02:28:09,000
So if we talk about data sources
here now we can steal the data

3617
02:28:09,000 --> 02:28:13,700
from multiple sources
like Market of the past events.

3618
02:28:13,700 --> 02:28:16,586
You have statuses
like at based mongodb,

3619
02:28:16,586 --> 02:28:20,051
which are you know,
SQL babies elasticsearch post

3620
02:28:20,051 --> 02:28:24,600
Vis equal pocket file format you
can Get all the data from here.

3621
02:28:24,600 --> 02:28:27,700
Now after that you can also
don't do processing

3622
02:28:27,700 --> 02:28:29,553
with the help
of machine learning.

3623
02:28:29,553 --> 02:28:32,700
You can do the processing
with the help of your spark SQL

3624
02:28:32,700 --> 02:28:34,800
and then give the output.

3625
02:28:34,900 --> 02:28:37,000
So this is a very strong thing

3626
02:28:37,000 --> 02:28:40,100
that you are bringing
the data using spot screaming

3627
02:28:40,100 --> 02:28:41,964
but processing you can do

3628
02:28:41,964 --> 02:28:44,800
by using some other
Frameworks as well.

3629
02:28:44,800 --> 02:28:47,514
Right like machine learning
you can apply on the data

3630
02:28:47,514 --> 02:28:49,549
what you're getting
fatter years time.

3631
02:28:49,549 --> 02:28:51,966
You can also apply
your spots equal on the data,

3632
02:28:51,966 --> 02:28:53,200
which you're getting at.

3633
02:28:53,200 --> 02:28:56,300
the real time Moving further.

3634
02:28:57,100 --> 02:29:00,089
So this is a single thing now
in Sparks giving you

3635
02:29:00,089 --> 02:29:03,200
what you can just get the data
from multiple sources

3636
02:29:03,200 --> 02:29:07,600
like from cough cough prove
sefs kinases Twitter bringing it

3637
02:29:07,600 --> 02:29:10,300
to this path screaming
doing the processing

3638
02:29:10,300 --> 02:29:12,500
and storing it back
to your hdfs.

3639
02:29:12,500 --> 02:29:15,900
Maybe you can bring it to
your DB you can also publish

3640
02:29:15,900 --> 02:29:17,400
to your UI dashboard.

3641
02:29:17,400 --> 02:29:21,402
Next Tableau angularjs lot
of UI dashboards are there

3642
02:29:21,700 --> 02:29:25,100
in which you can publish
your output now.

3643
02:29:25,500 --> 02:29:26,346
Holly quotes,

3644
02:29:26,346 --> 02:29:29,782
let us just break down
into more fine-grained gutters.

3645
02:29:29,782 --> 02:29:32,700
Now we are going to get
our input data stream.

3646
02:29:32,700 --> 02:29:34,500
We are going to put it inside

3647
02:29:34,500 --> 02:29:38,200
of a spot screaming going to get
the batches of input data.

3648
02:29:38,200 --> 02:29:40,772
Once it executes
to his path engine.

3649
02:29:40,772 --> 02:29:44,300
We are going to get that chest
of processed data.

3650
02:29:44,300 --> 02:29:47,146
We have just seen
the same diagram before so

3651
02:29:47,146 --> 02:29:49,000
the same explanation for it.

3652
02:29:49,000 --> 02:29:52,400
Now again breaking it down
into more glamour part.

3653
02:29:52,400 --> 02:29:55,060
We are getting a d
string B string was

3654
02:29:55,060 --> 02:29:58,800
what Vulnerabilities of data
multiple set of Harmony,

3655
02:29:58,800 --> 02:30:00,500
so we are getting a d string.

3656
02:30:00,500 --> 02:30:03,400
So let's say we are getting
an rdd and the rate of time but

3657
02:30:03,400 --> 02:30:06,200
because now we are getting
real steam data, right?

3658
02:30:06,200 --> 02:30:07,936
So let's say in today right now.

3659
02:30:07,936 --> 02:30:08,872
I got one second.

3660
02:30:08,872 --> 02:30:11,399
Maybe now I got some one second
in one second.

3661
02:30:11,399 --> 02:30:14,600
I got more data now I got
more data in the next not Frank.

3662
02:30:14,600 --> 02:30:16,300
So that is what
we're talking about.

3663
02:30:16,300 --> 02:30:17,602
So we are creating data.

3664
02:30:17,602 --> 02:30:20,322
We are getting from time
0 to time what we get say

3665
02:30:20,322 --> 02:30:22,171
that we have an RGB at the rate

3666
02:30:22,171 --> 02:30:24,556
of Timbre similarly
it is this proceeding

3667
02:30:24,556 --> 02:30:27,300
with the time that He's
getting proceeded here.

3668
02:30:27,400 --> 02:30:30,683
Now in the next thing
we extracting the words

3669
02:30:30,683 --> 02:30:32,400
from an input Stream So

3670
02:30:32,400 --> 02:30:33,300
if you can notice

3671
02:30:33,300 --> 02:30:35,550
what we are doing here
from where let's say,

3672
02:30:35,550 --> 02:30:37,700
we started applying
doing our operations

3673
02:30:37,700 --> 02:30:40,419
as we started doing
our any sort of processing.

3674
02:30:40,419 --> 02:30:43,200
So as in when we get the data
in this timeframe,

3675
02:30:43,200 --> 02:30:44,707
we started being subversive.

3676
02:30:44,707 --> 02:30:46,307
It can be a flat map operation.

3677
02:30:46,307 --> 02:30:49,300
It can be any sort of operation
you're doing it can be even

3678
02:30:49,300 --> 02:30:51,800
a machine-learning opposite
of whatever you are doing

3679
02:30:51,800 --> 02:30:55,600
and then you are generating
the words in that kind of thing.

3680
02:30:55,700 --> 02:30:58,700
So this is how we
as we're seeing

3681
02:30:58,700 --> 02:31:02,700
that how gravity we can kind
of see all these part

3682
02:31:02,700 --> 02:31:04,620
at a very high level this work.

3683
02:31:04,620 --> 02:31:06,738
We again went into
detail then again,

3684
02:31:06,738 --> 02:31:08,249
we went into more detail.

3685
02:31:08,249 --> 02:31:09,700
And finally we have seen

3686
02:31:09,700 --> 02:31:13,600
that how we can even process
the data along the time

3687
02:31:13,600 --> 02:31:16,594
when we are screaming
our data as well.

3688
02:31:17,100 --> 02:31:21,500
Now one important point is just
like spark context is

3689
02:31:21,853 --> 02:31:25,700
mean entry point for
any spark application similar.

3690
02:31:25,700 --> 02:31:28,300
Need to work on streaming a spot

3691
02:31:28,300 --> 02:31:31,600
screaming you require
a streaming context.

3692
02:31:31,700 --> 02:31:35,800
What is that when you're passing
your input data stream you

3693
02:31:35,800 --> 02:31:38,400
when you are working
on the Spark engine

3694
02:31:38,400 --> 02:31:41,000
when you're walking
on this path screaming engine,

3695
02:31:41,000 --> 02:31:42,900
you have to use your system

3696
02:31:42,900 --> 02:31:46,289
in context of its using
screaming context only

3697
02:31:46,289 --> 02:31:48,700
you are going to get the batches

3698
02:31:48,700 --> 02:31:52,300
of your input data now
so streaming context

3699
02:31:52,300 --> 02:31:57,000
is going to consume a stream
of data in In Apache spark,

3700
02:31:57,300 --> 02:31:58,800
it is registers

3701
02:31:58,800 --> 02:32:04,000
and input D string to produce
or receiver object.

3702
02:32:04,500 --> 02:32:08,200
Now it is the main entry point
as we discussed

3703
02:32:08,200 --> 02:32:11,011
that like spark context is
the main entry point

3704
02:32:11,011 --> 02:32:12,600
for the spark application.

3705
02:32:12,600 --> 02:32:13,400
Similarly.

3706
02:32:13,400 --> 02:32:16,110
Your streaming context
is an entry point

3707
02:32:16,110 --> 02:32:17,500
for yourself Paxton.

3708
02:32:17,500 --> 02:32:20,800
Now does that mean
now Spa context is

3709
02:32:20,800 --> 02:32:22,569
not an entry point know

3710
02:32:22,569 --> 02:32:25,779
when you creates pastrini
it is dependent.

3711
02:32:25,779 --> 02:32:27,600
On your spots community.

3712
02:32:27,600 --> 02:32:30,007
So when you create
this thing in context

3713
02:32:30,007 --> 02:32:33,509
it is going to be dependent
on your spark of context only

3714
02:32:33,509 --> 02:32:36,732
because you will not be able
to create swimming contest

3715
02:32:36,732 --> 02:32:38,000
without spot Pockets.

3716
02:32:38,000 --> 02:32:41,000
So that's the reason it
is definitely required spark

3717
02:32:41,000 --> 02:32:45,600
also provide a number of default
implementations of sources,

3718
02:32:45,800 --> 02:32:50,000
like looking in the data
from Critter a factor 0 mq

3719
02:32:50,100 --> 02:32:53,100
which are accessible
from the context.

3720
02:32:53,100 --> 02:32:55,800
So it is supporting
so many things, right?

3721
02:32:55,800 --> 02:32:58,600
now If you notice this

3722
02:32:58,600 --> 02:33:01,000
what we are doing
in streaming contact,

3723
02:33:01,000 --> 02:33:03,497
this is just to give
you an idea about

3724
02:33:03,497 --> 02:33:06,500
how we can initialize
our system in context.

3725
02:33:06,500 --> 02:33:09,971
So we will be importing
these two libraries after that.

3726
02:33:09,971 --> 02:33:12,923
Can you see I'm passing
spot context SE right son

3727
02:33:12,923 --> 02:33:14,400
passing it every second.

3728
02:33:14,400 --> 02:33:17,323
We are collecting the data
means collect the data

3729
02:33:17,323 --> 02:33:18,400
for every 1 second.

3730
02:33:18,400 --> 02:33:21,500
You can increase this number
if you want and then this

3731
02:33:21,500 --> 02:33:24,028
is your SSC means
in every one second

3732
02:33:24,028 --> 02:33:25,482
what ever gonna happen?

3733
02:33:25,482 --> 02:33:27,000
I'm going to process it.

3734
02:33:27,000 --> 02:33:28,800
And what we're doing
in this place,

3735
02:33:28,900 --> 02:33:33,100
let's go to the D string topic
now now in these three

3736
02:33:33,500 --> 02:33:37,000
it is the full form
is discretized stream.

3737
02:33:37,053 --> 02:33:38,900
It's a basic abstraction

3738
02:33:38,900 --> 02:33:41,679
provided by your spa
streaming framework.

3739
02:33:41,679 --> 02:33:46,400
It's appointing a stream of data
and it is going to be received

3740
02:33:46,400 --> 02:33:47,630
from your source

3741
02:33:47,630 --> 02:33:52,200
and from processed
steaming context is related

3742
02:33:52,200 --> 02:33:56,900
to your response living
Fun Spot context is belonging.

3743
02:33:56,900 --> 02:33:57,974
To your spark or

3744
02:33:57,974 --> 02:34:01,600
if you remember the ecosystem
radical in the ecosystem,

3745
02:34:01,600 --> 02:34:06,400
we have that spark context right
now streaming context is built

3746
02:34:06,400 --> 02:34:08,784
with the help of spark context.

3747
02:34:08,800 --> 02:34:11,800
And in fact using
streaming context only

3748
02:34:11,800 --> 02:34:15,604
you will be able to perform
your sponsoring just like

3749
02:34:15,604 --> 02:34:17,722
without spark context you will

3750
02:34:17,722 --> 02:34:19,700
not able to execute anything

3751
02:34:19,700 --> 02:34:22,482
in spark application
just park application

3752
02:34:22,482 --> 02:34:25,100
will not be able
to do anything similarly

3753
02:34:25,100 --> 02:34:27,200
without streaming content.

3754
02:34:27,200 --> 02:34:31,500
You're streaming application
will not be able to do anything.

3755
02:34:31,500 --> 02:34:34,838
It just that screaming
context is built on top

3756
02:34:34,838 --> 02:34:36,100
of spark context.

3757
02:34:36,500 --> 02:34:39,700
Okay, so it now it's
a continuous stream

3758
02:34:39,700 --> 02:34:42,400
of data we can talk
about these three.

3759
02:34:42,400 --> 02:34:46,200
It is received from source
of on the processed data speed

3760
02:34:46,200 --> 02:34:49,000
generated by the
transformation of interesting.

3761
02:34:49,300 --> 02:34:53,800
If you look at this part
internally a these thing

3762
02:34:53,800 --> 02:34:57,389
can be represented by
a continuous series of I

3763
02:34:57,389 --> 02:34:59,620
need these this is important.

3764
02:34:59,946 --> 02:35:04,400
Now what we're doing is
every second remember last time

3765
02:35:04,400 --> 02:35:05,800
we have just seen an example

3766
02:35:05,900 --> 02:35:08,335
of like every second
whatever going to happen.

3767
02:35:08,335 --> 02:35:10,100
We are going to do processing.

3768
02:35:10,200 --> 02:35:13,700
So in that every second
whatever data you

3769
02:35:13,700 --> 02:35:17,300
are collecting and you're
performing your operation.

3770
02:35:17,300 --> 02:35:18,010
So the data

3771
02:35:18,010 --> 02:35:21,500
what you're getting here is
will be your District means

3772
02:35:21,500 --> 02:35:23,129
it's a Content you can say

3773
02:35:23,129 --> 02:35:26,200
that all these things
will be your D string point.

3774
02:35:26,200 --> 02:35:29,800
It's our Representation
by a continuous series

3775
02:35:29,800 --> 02:35:32,300
of kinetic energy so
many hundred is getting more

3776
02:35:32,300 --> 02:35:34,500
because let's say right
knocking one second.

3777
02:35:34,500 --> 02:35:36,000
What data I got collected.

3778
02:35:36,000 --> 02:35:37,100
I executed it.

3779
02:35:37,100 --> 02:35:40,500
I in the second second
this data is happening here.

3780
02:35:40,715 --> 02:35:41,100
Okay?

3781
02:35:41,100 --> 02:35:41,800
Okay.

3782
02:35:41,800 --> 02:35:42,700
Sorry for that.

3783
02:35:42,700 --> 02:35:46,300
Now in the second time
also the it is happening

3784
02:35:46,300 --> 02:35:47,400
a third second.

3785
02:35:47,400 --> 02:35:49,000
Also it is happening here.

3786
02:35:49,700 --> 02:35:50,500
No problem.

3787
02:35:50,500 --> 02:35:53,100
No, I'm not going
to do it now fine.

3788
02:35:53,100 --> 02:35:54,727
So in the third second Auto

3789
02:35:54,727 --> 02:35:57,200
if I did something
I'm processing it here.

3790
02:35:57,200 --> 02:35:57,500
Right.

3791
02:35:57,500 --> 02:35:59,800
So if you see
that this diagram itself,

3792
02:35:59,800 --> 02:36:03,600
so it is every second whatever
data is getting collected.

3793
02:36:03,600 --> 02:36:05,400
We are doing the processing

3794
02:36:05,400 --> 02:36:09,250
on top of it and the whole
countenance series of RDV

3795
02:36:09,250 --> 02:36:13,100
what we are seeing here
will be called as the strip.

3796
02:36:13,100 --> 02:36:13,500
Okay.

3797
02:36:13,500 --> 02:36:18,100
So this is what your distinct
moving further now

3798
02:36:18,600 --> 02:36:22,300
we are going to understand
the operation on these three.

3799
02:36:22,300 --> 02:36:24,500
So let's say you are doing

3800
02:36:24,500 --> 02:36:27,300
this operation on this dream
that you are getting.

3801
02:36:27,300 --> 02:36:30,000
The data from 0 to 1 again,

3802
02:36:30,000 --> 02:36:32,300
you are applying some operation

3803
02:36:32,300 --> 02:36:36,108
on that then whatever output
you get you're going to call

3804
02:36:36,108 --> 02:36:39,200
it as words the state
means this is the thing

3805
02:36:39,200 --> 02:36:41,166
what you're doing you're doing
a pack of operation.

3806
02:36:41,166 --> 02:36:42,700
That's the reason
we're calling it is at

3807
02:36:42,700 --> 02:36:46,058
what these three now similarly
whatever thing you're doing.

3808
02:36:46,058 --> 02:36:48,000
So you're going
to get accordingly

3809
02:36:48,000 --> 02:36:50,569
and output be screen
for it as well.

3810
02:36:50,569 --> 02:36:55,100
So this is what is happening
in this particular example now.

3811
02:36:56,700 --> 02:36:59,700
Flat map flatmap is API.

3812
02:37:00,000 --> 02:37:02,100
It is very similar to mac.

3813
02:37:02,100 --> 02:37:04,089
Its kind of platen
of your value.

3814
02:37:04,089 --> 02:37:04,400
Okay.

3815
02:37:04,400 --> 02:37:06,400
So let me explain you
with an example.

3816
02:37:06,400 --> 02:37:07,300
What is flat back?

3817
02:37:07,500 --> 02:37:10,100
So let's say
if I say that hi,

3818
02:37:10,400 --> 02:37:13,200
this is a doulica.

3819
02:37:14,500 --> 02:37:15,600
Welcome.

3820
02:37:16,200 --> 02:37:18,100
Okay, let's say listen later.

3821
02:37:18,222 --> 02:37:18,723
Now.

3822
02:37:18,723 --> 02:37:20,800
I want to apply a flatworm.

3823
02:37:20,800 --> 02:37:22,900
So let's say this is
a form of rdd.

3824
02:37:22,900 --> 02:37:24,600
Also now on this rdd,

3825
02:37:24,600 --> 02:37:28,200
let's say I apply flat back
to let's say our DB this is

3826
02:37:28,200 --> 02:37:30,000
the already flat map.

3827
02:37:31,600 --> 02:37:35,000
It's not map
Captain black pepper.

3828
02:37:35,100 --> 02:37:38,467
And then let's say you want
to define something for it.

3829
02:37:38,467 --> 02:37:40,400
So let's say you say that okay,

3830
02:37:41,100 --> 02:37:43,400
you are defining
a variable sale.

3831
02:37:43,700 --> 02:37:48,300
So let's say a a DOT now

3832
02:37:48,400 --> 02:37:53,300
after that you are defining
your thoughts split split.

3833
02:37:55,300 --> 02:37:58,417
We're splitting with respect
to visit now in this case

3834
02:37:58,417 --> 02:38:00,106
what is going to happen now?

3835
02:38:00,106 --> 02:38:03,966
I'm not saying the exacting here
just to give extremely flat back

3836
02:38:03,966 --> 02:38:06,500
just to kind of give
you an idea about box.

3837
02:38:06,503 --> 02:38:09,196
It is going to flatten
up this fight

3838
02:38:09,200 --> 02:38:11,200
with respect to the split

3839
02:38:11,200 --> 02:38:15,200
what you are mentioned here
means what it is going to now

3840
02:38:15,200 --> 02:38:18,500
create each element as one word.

3841
02:38:18,684 --> 02:38:21,915
It is going to create
like this high as one

3842
02:38:22,200 --> 02:38:26,100
what l 1 element this
as one One element

3843
02:38:26,100 --> 02:38:27,515
is ask another what

3844
02:38:27,515 --> 02:38:30,939
a one-element adwaita as
one water in the limit.

3845
02:38:30,939 --> 02:38:33,200
Bentham has one
vote for example.

3846
02:38:33,200 --> 02:38:33,841
So this is

3847
02:38:33,841 --> 02:38:37,558
how your platinum Works kind
of flatten up your whole file.

3848
02:38:37,558 --> 02:38:40,700
So this is what we are doing
in our stream effort.

3849
02:38:40,700 --> 02:38:43,400
We are our so this is
how this will work.

3850
02:38:44,100 --> 02:38:47,143
Now so we have just
understood this part.

3851
02:38:47,143 --> 02:38:51,100
Now, let's understand input
the stream and receivers.

3852
02:38:51,100 --> 02:38:52,500
Okay, what are these things?

3853
02:38:52,500 --> 02:38:53,900
Let's understand this fight.

3854
02:38:54,800 --> 02:38:55,200
Okay.

3855
02:38:55,200 --> 02:38:57,700
So what are the input
based impossible?

3856
02:38:57,700 --> 02:39:00,900
They can be basic Source
advances in basic Source

3857
02:39:00,900 --> 02:39:04,500
we can have filesystems
sockets Connections

3858
02:39:04,600 --> 02:39:08,400
in advance Source we
can have Kafka no Genesis.

3859
02:39:08,800 --> 02:39:09,200
Okay.

3860
02:39:09,300 --> 02:39:10,800
So your input these things are

3861
02:39:10,800 --> 02:39:14,000
under these things
representing the stream

3862
02:39:14,300 --> 02:39:19,200
of input data received
from streaming sources.

3863
02:39:19,400 --> 02:39:20,865
This is again the same thing.

3864
02:39:20,865 --> 02:39:21,136
Okay.

3865
02:39:21,136 --> 02:39:23,198
So this is there are
two type of things

3866
02:39:23,198 --> 02:39:24,500
which we just discussed.

3867
02:39:24,600 --> 02:39:27,676
Is your basic and second
is your advance?

3868
02:39:28,400 --> 02:39:29,800
Let's move brother.

3869
02:39:30,700 --> 02:39:33,700
Now what we are going
to see each other.

3870
02:39:33,700 --> 02:39:35,870
So if you notice let's see here.

3871
02:39:35,870 --> 02:39:39,600
There are some events often it
is going to your receiver

3872
02:39:39,600 --> 02:39:44,158
and then energy stream now I
will bees are getting created

3873
02:39:44,158 --> 02:39:47,082
and we are performing
some steps on it.

3874
02:39:47,300 --> 02:39:52,300
So the receiver sends
the data into the D string

3875
02:39:52,500 --> 02:39:57,100
where each back is going
to contain the RTD.

3876
02:39:57,200 --> 02:40:00,800
So this is what you're
this thing is doing receiver.

3877
02:40:00,800 --> 02:40:02,500
Is doing here now

3878
02:40:03,500 --> 02:40:07,200
moving further Transformations
on the D string.

3879
02:40:07,200 --> 02:40:08,384
Let's understand that.

3880
02:40:08,384 --> 02:40:10,500
What are the
Transformations available?

3881
02:40:10,500 --> 02:40:13,000
There are multiple
Transformations, which are

3882
02:40:13,000 --> 02:40:14,700
possibly the most popular.

3883
02:40:14,700 --> 02:40:16,100
Let's talk about that.

3884
02:40:16,100 --> 02:40:20,700
We have map flatmap filter
reduce Group by so there

3885
02:40:20,700 --> 02:40:23,992
are multiple Transformations
available via now.

3886
02:40:23,992 --> 02:40:27,500
It is like you are getting
your input data now you

3887
02:40:27,500 --> 02:40:30,400
will be applying any
of these operations.

3888
02:40:30,400 --> 02:40:33,700
Means any Transformations
that is going to happen.

3889
02:40:33,700 --> 02:40:37,700
And then on you this thing
is going to be created.

3890
02:40:37,700 --> 02:40:39,900
Okay, so that is
what's going to happen.

3891
02:40:39,900 --> 02:40:41,851
So let's explore it one by one.

3892
02:40:41,851 --> 02:40:43,344
So let's start with now

3893
02:40:43,344 --> 02:40:46,200
if I start with map
what happens with Mac

3894
02:40:46,200 --> 02:40:48,600
it is going to create
that judges of data.

3895
02:40:48,600 --> 02:40:49,100
Okay.

3896
02:40:49,100 --> 02:40:51,386
So let's say it is going
to create a map value

3897
02:40:51,386 --> 02:40:52,200
of it like this.

3898
02:40:52,200 --> 02:40:55,600
So let's say X is not to be
my is giving the output Z

3899
02:40:55,600 --> 02:40:57,600
that is giving
the output X, right.

3900
02:40:57,600 --> 02:41:00,700
So in this similar format,
this is going to get mad.

3901
02:41:00,700 --> 02:41:02,887
That is going to whatever
you're performing.

3902
02:41:02,887 --> 02:41:05,394
It is just going to create
batches of input data,

3903
02:41:05,394 --> 02:41:06,700
which you can execute it.

3904
02:41:06,700 --> 02:41:10,800
So it returns a new DC
by fasting each element

3905
02:41:10,800 --> 02:41:13,946
of the source D string
through a function,

3906
02:41:13,946 --> 02:41:15,600
which you have defined.

3907
02:41:16,300 --> 02:41:17,789
Let's discuss this lapis

3908
02:41:17,789 --> 02:41:20,074
that we have just
discussed it is going

3909
02:41:20,074 --> 02:41:21,565
to flatten up the things.

3910
02:41:21,565 --> 02:41:22,805
So in this case, also,

3911
02:41:22,805 --> 02:41:25,400
if you notice we are just
kind of flat inner it

3912
02:41:25,400 --> 02:41:27,169
is very similar to Mac.

3913
02:41:27,169 --> 02:41:31,100
But each input item
can be mapped to zero

3914
02:41:31,200 --> 02:41:34,200
or more outputs in items here.

3915
02:41:34,200 --> 02:41:38,400
Okay, and it is going to return
a new these three bypassing

3916
02:41:38,400 --> 02:41:41,700
each Source element
to a function for this fight.

3917
02:41:41,700 --> 02:41:44,600
So we have just seen an example
of that crap anyway,

3918
02:41:44,600 --> 02:41:47,300
so that seems awfully
can remember 70 more easy

3919
02:41:47,300 --> 02:41:49,200
for you to kind of
see the difference

3920
02:41:49,200 --> 02:41:55,260
between with markets has
no moving further filter

3921
02:41:55,360 --> 02:41:58,593
as the name States you
can now filter out the values.

3922
02:41:58,593 --> 02:41:59,876
So let's say you have

3923
02:41:59,876 --> 02:42:03,701
a huge data you are kind of we
want to filter out some values.

3924
02:42:03,701 --> 02:42:06,900
You just want to kind of walk
with some filter data.

3925
02:42:06,900 --> 02:42:09,700
Maybe you want to remove
some part of it.

3926
02:42:09,700 --> 02:42:11,900
Maybe you are trying
to put some Logic on it.

3927
02:42:11,900 --> 02:42:15,800
Does this line contains
this right under this line?

3928
02:42:16,100 --> 02:42:16,900
Is that so

3929
02:42:16,900 --> 02:42:20,169
in that case extreme only
with that particular criteria?

3930
02:42:20,169 --> 02:42:21,691
So this is what we do here,

3931
02:42:21,691 --> 02:42:25,300
but definitely most of the times
to Output is going to be smaller

3932
02:42:25,300 --> 02:42:31,000
in comparison to your input
reduce reduce is it's just

3933
02:42:31,000 --> 02:42:34,500
like it's going to do kind
of aggregation on the wall.

3934
02:42:34,500 --> 02:42:37,400
Let's say in the end you want
to sum up all the data

3935
02:42:37,400 --> 02:42:38,200
what you have

3936
02:42:38,200 --> 02:42:41,500
that is going to be done
with the help of reduce.

3937
02:42:42,100 --> 02:42:43,800
Now after that group

3938
02:42:43,800 --> 02:42:48,600
by group back is like it's going
to combine all the common values

3939
02:42:48,600 --> 02:42:50,600
that is what group
by is going to do.

3940
02:42:50,600 --> 02:42:53,112
So as you can see
in this example all the things

3941
02:42:53,112 --> 02:42:55,196
which are starting
with Seagal broom back

3942
02:42:55,196 --> 02:42:56,786
all the things we're starting

3943
02:42:56,786 --> 02:42:59,300
with J. Boardroom back
all the names starting

3944
02:42:59,300 --> 02:43:00,761
with C got goodbye.

3945
02:43:00,800 --> 02:43:01,600
Not.

3946
02:43:02,000 --> 02:43:03,300
So again, what is

3947
02:43:03,300 --> 02:43:07,500
this screen window now to give
you an example of this window?

3948
02:43:07,500 --> 02:43:10,108
Everybody must be
knowing Twitter, right?

3949
02:43:10,108 --> 02:43:12,000
So now what happens in total?

3950
02:43:12,000 --> 02:43:13,700
Let me go to my paint.

3951
02:43:14,100 --> 02:43:16,100
So insert in this example,

3952
02:43:16,100 --> 02:43:19,853
let's understand how
this windowing of Asians of so,

3953
02:43:19,853 --> 02:43:21,400
let's say in initials

3954
02:43:21,400 --> 02:43:24,600
per second in the initial
per second 10 seconds.

3955
02:43:24,600 --> 02:43:27,200
Let's say the tweets
are happening in this way.

3956
02:43:27,200 --> 02:43:32,200
Let's say cash
a hash a hashtag now,

3957
02:43:32,200 --> 02:43:35,773
which is the trading Twitter
definitely is right is

3958
02:43:35,773 --> 02:43:38,900
my training good maybe
in the next 10 seconds.

3959
02:43:40,600 --> 02:43:46,500
In the next 10 seconds
now again Hash A. Ashby.

3960
02:43:47,200 --> 02:43:48,400
Ashby is open

3961
02:43:48,400 --> 02:43:51,400
which is the trending
with be happening here.

3962
02:43:51,400 --> 02:43:51,800
Now.

3963
02:43:51,800 --> 02:43:54,261
Let's say in another 10 seconds.

3964
02:43:54,900 --> 02:43:56,700
Now this time let's say

3965
02:43:56,700 --> 02:44:03,266
hash be hash be so actually I
should be Hashmi zapping now,

3966
02:44:03,266 --> 02:44:05,266
which is trendy be lonely.

3967
02:44:05,500 --> 02:44:07,776
But now I want to find out

3968
02:44:07,776 --> 02:44:10,546
which is the trending
one in last 30.

3969
02:44:11,400 --> 02:44:15,100
Ashley right because
if I combine I can do it easily.

3970
02:44:15,400 --> 02:44:19,900
Now this is your been doing
operation example means you

3971
02:44:19,900 --> 02:44:23,300
are not only looking
at your current window,

3972
02:44:23,300 --> 02:44:24,800
but you're also looking

3973
02:44:24,800 --> 02:44:27,516
at your previous window
Vanessa current window.

3974
02:44:27,516 --> 02:44:30,008
I'm talking about let's say
10 seconds of slot

3975
02:44:30,008 --> 02:44:32,600
in this 10 seconds lat
let's say you are doing

3976
02:44:32,600 --> 02:44:35,431
this operation on has be has
to be has to be has to be

3977
02:44:35,431 --> 02:44:37,456
so this is a current
window now you are

3978
02:44:37,456 --> 02:44:40,282
not fully Computing with respect
to your current window.

3979
02:44:40,282 --> 02:44:42,800
But you are also considering
your previous window.

3980
02:44:42,800 --> 02:44:44,055
Now, let's say in this case.

3981
02:44:44,055 --> 02:44:44,681
If I ask you,

3982
02:44:44,681 --> 02:44:46,900
can you give me the output
of which is trending

3983
02:44:46,900 --> 02:44:48,361
in last 17 seconds?

3984
02:44:48,361 --> 02:44:50,900
Will you be able
to answer know why

3985
02:44:50,900 --> 02:44:54,900
because you don't have partial
information for 7 Seconds

3986
02:44:54,900 --> 02:44:56,400
you have information

3987
02:44:56,400 --> 02:45:01,000
for your 10 20 30 mins
multiple of them,

3988
02:45:01,200 --> 02:45:03,500
but not intermediate one.

3989
02:45:03,500 --> 02:45:04,711
So keep this in mind.

3990
02:45:04,711 --> 02:45:07,365
Okay, so you will be able
to perform in doing

3991
02:45:07,365 --> 02:45:10,207
operation only with respect
to your window size.

3992
02:45:10,207 --> 02:45:11,900
It's not like you can create

3993
02:45:11,900 --> 02:45:15,085
any partial value in can do
the window efficient now,

3994
02:45:15,085 --> 02:45:16,800
let's get back to the sides.

3995
02:45:21,800 --> 02:45:23,203
Now it's a similar thing.

3996
02:45:23,203 --> 02:45:24,350
So now it is shown here

3997
02:45:24,350 --> 02:45:27,100
that we are not only considering
the current window,

3998
02:45:27,100 --> 02:45:30,200
but we are also considering
the previous window

3999
02:45:30,200 --> 02:45:31,604
now next understand

4000
02:45:31,604 --> 02:45:35,300
the output operators are
operations of the business

4001
02:45:35,700 --> 02:45:38,434
when we talk
about output operations.

4002
02:45:38,434 --> 02:45:41,400
The output operations
are going to allow

4003
02:45:41,400 --> 02:45:45,853
the D string data to be pushed
out to your external system.

4004
02:45:45,853 --> 02:45:47,700
If you notice here means

4005
02:45:47,700 --> 02:45:51,300
whenever whatever processing
you have done with respect to

4006
02:45:51,300 --> 02:45:54,300
what What data you are doing
here now your output you

4007
02:45:54,300 --> 02:45:57,100
can store in multiple base
against original file system.

4008
02:45:57,100 --> 02:45:58,600
You can keep in your database.

4009
02:45:58,600 --> 02:46:01,800
You can keep it even
in your external systems

4010
02:46:01,800 --> 02:46:04,200
so you can keep
in multiple places.

4011
02:46:04,200 --> 02:46:06,400
So that is
what being reflected here.

4012
02:46:07,500 --> 02:46:10,600
Now, so if I talk
about output operation,

4013
02:46:10,600 --> 02:46:11,653
these are the one

4014
02:46:11,653 --> 02:46:15,495
which are supported we can print
out the value we can use save

4015
02:46:15,495 --> 02:46:17,700
as text file menu save
as take five.

4016
02:46:17,700 --> 02:46:19,500
It saves it into your chest.

4017
02:46:19,500 --> 02:46:21,736
If you want you can
also use it to save it

4018
02:46:21,736 --> 02:46:23,100
in the local pack system.

4019
02:46:23,100 --> 02:46:25,174
You can save it as
an object file.

4020
02:46:25,174 --> 02:46:27,500
Also, you can save
it as a Hadoop file

4021
02:46:27,500 --> 02:46:30,800
or you can also apply
for these are daily function.

4022
02:46:31,200 --> 02:46:34,500
Now what are for
each argument function?

4023
02:46:34,500 --> 02:46:35,956
Let's see this example.

4024
02:46:35,956 --> 02:46:39,700
So the mill Levy Spin on this
part in detail Banks we teach

4025
02:46:39,700 --> 02:46:41,600
you or in advocacy sessions,

4026
02:46:41,600 --> 02:46:43,927
but just to give
you an idea now.

4027
02:46:43,927 --> 02:46:46,310
This is a very
powerful primitive

4028
02:46:46,310 --> 02:46:49,608
that is going to allow
your data to be sent out

4029
02:46:49,608 --> 02:46:51,400
to your external systems.

4030
02:46:51,400 --> 02:46:53,700
So using this you
can send it across

4031
02:46:53,700 --> 02:46:55,500
to your web server system.

4032
02:46:55,500 --> 02:46:57,385
We have just seen
our external system

4033
02:46:57,385 --> 02:46:58,904
that we can give file system.

4034
02:46:58,904 --> 02:46:59,900
It can be anything.

4035
02:46:59,900 --> 02:47:02,800
So using this you
will be able to transfer it.

4036
02:47:02,800 --> 02:47:05,100
You can view will be
able to send it out

4037
02:47:05,100 --> 02:47:07,162
to your external systems.

4038
02:47:07,500 --> 02:47:11,500
Now, let's understand the cash
in and persistence now

4039
02:47:11,500 --> 02:47:14,300
when we talk
about caching and persistence,

4040
02:47:14,300 --> 02:47:18,900
so these 3 Ms. Also annoying
the developers to cash

4041
02:47:19,000 --> 02:47:22,100
or to persist the streams data

4042
02:47:22,100 --> 02:47:27,023
in the moral means you
can keep your data in memory.

4043
02:47:27,023 --> 02:47:31,100
You can cash your data
in the morning for longer time.

4044
02:47:31,200 --> 02:47:33,200
Even after your
action is complete.

4045
02:47:33,200 --> 02:47:36,000
It is not going to delete it

4046
02:47:36,100 --> 02:47:38,946
so you can just Use
this as many times

4047
02:47:38,946 --> 02:47:39,800
as you want

4048
02:47:39,800 --> 02:47:42,900
so you can simply use
the first method to do that.

4049
02:47:42,900 --> 02:47:44,485
So for your input streams

4050
02:47:44,485 --> 02:47:48,100
which are receiving the data
over the network may be using

4051
02:47:48,100 --> 02:47:50,000
taskbar Loom sockets.

4052
02:47:50,400 --> 02:47:54,500
The default persistence level
is set to the replicate

4053
02:47:54,500 --> 02:47:57,331
the data to two loads
for the for tolerance

4054
02:47:57,331 --> 02:48:00,500
like it is also going
to be replicating the data

4055
02:48:00,502 --> 02:48:01,600
into two parts

4056
02:48:01,600 --> 02:48:04,800
so you can see the same thing
in this diagram.

4057
02:48:05,300 --> 02:48:07,979
Let's understand this
accumulators broadcast

4058
02:48:07,979 --> 02:48:09,600
variables and checkpoints.

4059
02:48:09,700 --> 02:48:12,553
Now, these are mostly
for your performance.

4060
02:48:12,553 --> 02:48:16,626
But so this is going to help you
to kind of perform to help you

4061
02:48:16,626 --> 02:48:18,444
in the performance partner.

4062
02:48:18,444 --> 02:48:20,600
So it is accumulators is nothing

4063
02:48:20,600 --> 02:48:25,200
but environment that are only
added through and associative

4064
02:48:25,300 --> 02:48:27,400
and commutative operation.

4065
02:48:28,000 --> 02:48:31,100
Usually if you're coming
from Purdue background

4066
02:48:31,100 --> 02:48:32,678
if you have done let's say be

4067
02:48:32,678 --> 02:48:35,400
mapreduce programming you
must have seen something.

4068
02:48:35,400 --> 02:48:36,900
Counters like that,

4069
02:48:36,900 --> 02:48:38,749
they'll be used
for other counters

4070
02:48:38,749 --> 02:48:42,000
which kind of helps us to debug
the program as well and you

4071
02:48:42,000 --> 02:48:44,700
can perform some analysis
in the console itself.

4072
02:48:44,700 --> 02:48:46,600
Now this is similar
to you can do

4073
02:48:46,600 --> 02:48:48,100
with the accumulators as well.

4074
02:48:48,100 --> 02:48:50,152
So you can Implement
your contest with X

4075
02:48:50,152 --> 02:48:52,800
open this part you can
also some of the things

4076
02:48:52,800 --> 02:48:54,800
with this fact now you can

4077
02:48:54,800 --> 02:48:57,800
if you want to track
through UI you can also do

4078
02:48:57,800 --> 02:49:00,402
that as you can see
in this UI itself.

4079
02:49:00,402 --> 02:49:02,500
You can see all your excavators

4080
02:49:02,500 --> 02:49:05,400
as well now similarly
we have broadcast.

4081
02:49:05,400 --> 02:49:10,300
Erebus now broadcast Parables
allows the programmer to keep

4082
02:49:10,300 --> 02:49:14,787
your meat only bearable cast
on all the machines

4083
02:49:14,787 --> 02:49:16,325
which are available.

4084
02:49:16,838 --> 02:49:19,838
Now it is going
to be kind of cashing it

4085
02:49:19,838 --> 02:49:21,684
on all the machines now,

4086
02:49:22,000 --> 02:49:25,900
they can be used to give
every note of copy

4087
02:49:26,200 --> 02:49:29,000
of a large input data set

4088
02:49:29,300 --> 02:49:35,028
in an efficient manner so you
can just use that sparkle.

4089
02:49:35,028 --> 02:49:39,643
Also attempt to distribute the
distributed broadcast variable

4090
02:49:39,643 --> 02:49:41,700
using efficient bra strap.

4091
02:49:41,700 --> 02:49:44,907
I will do nothing to reduce
the communication process.

4092
02:49:44,907 --> 02:49:46,100
So as you can see here,

4093
02:49:46,100 --> 02:49:47,800
we are passing
this broadcast value

4094
02:49:47,800 --> 02:49:50,700
it is going to spark contest
and then it is broadcasting

4095
02:49:50,700 --> 02:49:51,700
to this places.

4096
02:49:51,700 --> 02:49:55,500
So this is what how it
is working in this application.

4097
02:49:55,700 --> 02:49:58,582
Generally when we teach
in this class has and also

4098
02:49:58,582 --> 02:50:00,600
since things are
Advanced concept,

4099
02:50:00,600 --> 02:50:02,953
we kind of we kind
of try to expand you

4100
02:50:02,953 --> 02:50:05,189
with the practicals
are not right now.

4101
02:50:05,189 --> 02:50:08,915
I just want to give you an idea
about what are these things?

4102
02:50:08,915 --> 02:50:09,764
So when you go

4103
02:50:09,764 --> 02:50:12,009
with the practicals
of all these things

4104
02:50:12,009 --> 02:50:13,367
that how activator see

4105
02:50:13,367 --> 02:50:16,700
how this is happening out
is getting broadcasted Things

4106
02:50:16,700 --> 02:50:19,941
become more and more fear
at that time right now.

4107
02:50:19,941 --> 02:50:20,683
I just want

4108
02:50:20,683 --> 02:50:24,600
that everybody at these data
high level overview of things.

4109
02:50:25,246 --> 02:50:28,400
Now moving further sub
what is checkpoints

4110
02:50:28,400 --> 02:50:30,257
so checkpoints are similar

4111
02:50:30,257 --> 02:50:32,900
to your checkpoints
in the gaming now,

4112
02:50:32,900 --> 02:50:37,200
hold on they can they make
it run 24/7 make it resilient

4113
02:50:37,200 --> 02:50:41,400
to the failure and related
to the application project.

4114
02:50:41,500 --> 02:50:43,214
So if you can see this diagram,

4115
02:50:43,214 --> 02:50:45,296
we are just
creating the checkpoint.

4116
02:50:45,296 --> 02:50:47,200
So as in the
metadata checkpoint,

4117
02:50:47,200 --> 02:50:50,279
you can see it is the saving
of the information

4118
02:50:50,279 --> 02:50:53,827
which is defining the streaming
computation if we talk

4119
02:50:53,827 --> 02:50:55,300
about data from check.

4120
02:50:55,600 --> 02:51:01,000
It is saving of the generated
a DD to the reliable storage.

4121
02:51:01,100 --> 02:51:03,400
So this is this
both are generating

4122
02:51:03,400 --> 02:51:06,900
the checkpoint now
now moving forward.

4123
02:51:06,900 --> 02:51:09,815
We are going to move
towards our project

4124
02:51:09,815 --> 02:51:14,300
where we are going to perform
our Twitter sentiment analysis.

4125
02:51:14,400 --> 02:51:17,413
Let's discuss a very
important Force case

4126
02:51:17,413 --> 02:51:19,600
of Twitter sentiment analysis.

4127
02:51:19,600 --> 02:51:21,500
This is going to
be very interesting

4128
02:51:21,500 --> 02:51:24,600
because we will just
do a real-time.

4129
02:51:24,900 --> 02:51:28,588
This on Twitter sentiment
analysis and they can be

4130
02:51:28,588 --> 02:51:31,900
lot of possibility
of this sentiment analysis

4131
02:51:31,900 --> 02:51:33,631
will be but we will
be taking something

4132
02:51:33,631 --> 02:51:36,000
for the turtle and it's going
to be very interesting.

4133
02:51:36,100 --> 02:51:39,900
So generally when we do
all this in know course,

4134
02:51:39,900 --> 02:51:41,070
it is more detailed

4135
02:51:41,070 --> 02:51:44,582
because right now in women
are definitely going in deep is

4136
02:51:44,582 --> 02:51:46,000
not very much possible,

4137
02:51:46,000 --> 02:51:48,600
but during the training
of a director,

4138
02:51:48,600 --> 02:51:51,470
you will learn all these things
within the trust awesome,

4139
02:51:51,470 --> 02:51:52,994
right that's there something

4140
02:51:52,994 --> 02:51:55,100
which we learned
during the session.

4141
02:51:55,100 --> 02:51:59,061
It's No, we talked
about some use cases of Twitter.

4142
02:51:59,300 --> 02:52:01,300
As I said there can be
multiple use cases

4143
02:52:01,300 --> 02:52:02,300
which are possible

4144
02:52:02,300 --> 02:52:04,156
because there are solutions

4145
02:52:04,156 --> 02:52:07,100
behind whatever the continue
doing it so much

4146
02:52:07,100 --> 02:52:08,700
of social media right now

4147
02:52:08,700 --> 02:52:11,288
in these days are
very active has been right.

4148
02:52:11,288 --> 02:52:12,400
It must be noticing

4149
02:52:12,400 --> 02:52:15,300
that even politicians
have started using Twitter

4150
02:52:15,300 --> 02:52:18,000
and their did all
the treats are being shown

4151
02:52:18,000 --> 02:52:21,200
in the news channel in cystic
of a heart-rending to it

4152
02:52:21,200 --> 02:52:23,900
because they are talking
about positive negative

4153
02:52:23,900 --> 02:52:26,100
in any politician
use Something right?

4154
02:52:26,100 --> 02:52:27,900
And if we talk
about anything is even

4155
02:52:27,900 --> 02:52:29,100
if we talk about let's

4156
02:52:29,100 --> 02:52:32,260
any Sports FIFA World Cup
is going on then you will notice

4157
02:52:32,260 --> 02:52:35,200
always return will be filled up
with lot of treatment.

4158
02:52:35,200 --> 02:52:38,435
So how we can make use of it
how we can do some analysis

4159
02:52:38,435 --> 02:52:41,400
on top of it that first we
are going to learn in this

4160
02:52:41,400 --> 02:52:44,600
so they can be multiple sort
of our sentiment analysis

4161
02:52:44,600 --> 02:52:47,595
think it can be done for
your crisis Management Service.

4162
02:52:47,595 --> 02:52:50,900
I just think target marketing
we can keep on talking about

4163
02:52:50,900 --> 02:52:52,716
when a new release release now

4164
02:52:52,716 --> 02:52:55,200
even the moviemakers
kind of glowing eyes.

4165
02:52:55,200 --> 02:52:57,628
Okay, hold this movie
is going to perform

4166
02:52:57,628 --> 02:53:00,356
so they can easily make
out of it beforehand.

4167
02:53:00,356 --> 02:53:04,200
Okay, this movie is going to go
in this kind of range of profit

4168
02:53:04,200 --> 02:53:05,800
or not interesting day.

4169
02:53:05,800 --> 02:53:08,200
I let us explore
not to Impossible even

4170
02:53:08,200 --> 02:53:10,500
in the political campaign
in 50 must have heard

4171
02:53:10,600 --> 02:53:11,400
that in u.s.

4172
02:53:11,400 --> 02:53:13,600
When the president
election was happening.

4173
02:53:13,600 --> 02:53:15,676
They have used in fact role

4174
02:53:15,676 --> 02:53:19,600
of social media of all
this analysis at all and then

4175
02:53:19,600 --> 02:53:22,400
that have ever played
a major role in winning

4176
02:53:22,400 --> 02:53:23,880
that election similarly,

4177
02:53:23,880 --> 02:53:26,100
how weather investors
want to predict

4178
02:53:26,100 --> 02:53:28,950
whether they should invest
in a particular company or not,

4179
02:53:28,950 --> 02:53:30,300
whether they want to check

4180
02:53:30,300 --> 02:53:33,715
that whether like we
should Target which customers

4181
02:53:33,715 --> 02:53:34,900
for advertisement

4182
02:53:34,900 --> 02:53:38,000
because we cannot Target
everyone problem with targeting

4183
02:53:38,000 --> 02:53:40,580
everyone is and if we try
to Target element,

4184
02:53:40,580 --> 02:53:43,032
it will be very costly
so we want to kind

4185
02:53:43,032 --> 02:53:44,333
of set it a little bit

4186
02:53:44,333 --> 02:53:46,178
because maybe my set
of people whom I

4187
02:53:46,178 --> 02:53:48,954
should send this advertisement
to be more effective

4188
02:53:48,954 --> 02:53:52,000
and Wells as well as a queen
is going to be cost effective

4189
02:53:52,000 --> 02:53:54,100
as well if you wanted
to do the products

4190
02:53:54,100 --> 02:53:57,200
and services also include
I guess we can also do this.

4191
02:53:57,200 --> 02:53:57,500
Now.

4192
02:53:57,500 --> 02:54:00,900
Let's see some use cases
like the him terms of use case.

4193
02:54:00,900 --> 02:54:03,100
I will show you a practical
how it comes.

4194
02:54:03,100 --> 02:54:04,000
So first of all,

4195
02:54:04,000 --> 02:54:06,724
we will be importing all
the required packages

4196
02:54:06,724 --> 02:54:08,725
because we are going
to not perform

4197
02:54:08,725 --> 02:54:10,400
or Twitter sentiment analysis.

4198
02:54:10,400 --> 02:54:12,824
So we will be requiring
some packages for that.

4199
02:54:12,824 --> 02:54:15,700
So we will be doing that as
a first step then we need

4200
02:54:15,700 --> 02:54:18,641
to SEC Oliver authentication
without or indication.

4201
02:54:18,641 --> 02:54:21,405
We cannot do anything
of now here the challenges

4202
02:54:21,405 --> 02:54:23,201
we cannot directly
put your username

4203
02:54:23,201 --> 02:54:24,431
and they don't you think

4204
02:54:24,431 --> 02:54:27,100
it will get Candidate put
your username and password.

4205
02:54:27,200 --> 02:54:28,800
So Peter came up with something.

4206
02:54:28,800 --> 02:54:30,400
Very smart thing.

4207
02:54:30,500 --> 02:54:33,100
What they did is they came
up with something

4208
02:54:33,100 --> 02:54:35,080
on his fourth indication tokens.

4209
02:54:35,080 --> 02:54:37,100
So you have to go
to death brought

4210
02:54:37,100 --> 02:54:39,100
twitter.com login from there

4211
02:54:39,100 --> 02:54:42,972
and you will find kind of all
this authentication tokens

4212
02:54:42,972 --> 02:54:44,100
available to you

4213
02:54:44,100 --> 02:54:47,900
for will be the recruit take
that and put it here then

4214
02:54:47,900 --> 02:54:50,335
as we have learned
the D string transformation,

4215
02:54:50,335 --> 02:54:52,294
you will be doing
all that computation

4216
02:54:52,294 --> 02:54:55,100
you so you will be having
my distinct honor of France.

4217
02:54:55,100 --> 02:54:58,100
Action, then you will be
generating your Tweet data.

4218
02:54:58,100 --> 02:55:01,472
I'm going to save it
in this particular directory.

4219
02:55:01,472 --> 02:55:03,400
Once you are done with this.

4220
02:55:03,400 --> 02:55:06,200
Then you are going
to extract your sentiment

4221
02:55:06,200 --> 02:55:07,600
once you extract it.

4222
02:55:07,600 --> 02:55:08,400
And you're done.

4223
02:55:08,400 --> 02:55:11,900
Let me show you quickly
how it works in our fear.

4224
02:55:12,000 --> 02:55:15,226
Now one more interesting thing
about a greater would be

4225
02:55:15,226 --> 02:55:18,247
that you will be getting all
this consideration machines.

4226
02:55:18,247 --> 02:55:19,482
So you need not worry

4227
02:55:19,482 --> 02:55:21,892
about from where I
will be getting all this.

4228
02:55:21,892 --> 02:55:25,100
Is it like very difficult
to install when I was waiting.

4229
02:55:25,100 --> 02:55:26,400
This open source location.

4230
02:55:26,400 --> 02:55:29,061
It was not working for me
in my operating system.

4231
02:55:29,061 --> 02:55:30,179
It was not working.

4232
02:55:30,179 --> 02:55:32,400
So many things we
have generally seen

4233
02:55:32,400 --> 02:55:34,700
people face issues to resolve

4234
02:55:34,700 --> 02:55:36,600
everything up be we kind

4235
02:55:36,600 --> 02:55:40,000
of provide all this fear
question from Rockville.

4236
02:55:40,000 --> 02:55:41,900
This pm has priest but yes,

4237
02:55:41,900 --> 02:55:44,300
that's what it has
everything pre-installed.

4238
02:55:44,300 --> 02:55:46,700
Whichever will be required
for your training.

4239
02:55:46,700 --> 02:55:49,133
So that's the best part
what we also provide.

4240
02:55:49,133 --> 02:55:51,700
So in this case your Eclipse
will already be there.

4241
02:55:51,700 --> 02:55:53,900
You need to just go
to your Eclipse location.

4242
02:55:53,900 --> 02:55:55,300
Let me show you how you can.

4243
02:55:55,300 --> 02:55:56,700
So cold that if you want

4244
02:55:57,200 --> 02:56:00,600
because it gives you it gives
you just need to go inside it

4245
02:56:00,600 --> 02:56:02,200
and double-click on it at that.

4246
02:56:02,200 --> 02:56:04,400
You need not go and kind
of installed eclipse

4247
02:56:04,400 --> 02:56:07,400
and not even the spot will
already be installed for you.

4248
02:56:07,400 --> 02:56:09,900
Let us go in our project.

4249
02:56:09,900 --> 02:56:12,895
So this is our project
which is in front of you.

4250
02:56:12,895 --> 02:56:15,674
This is my project
which we are going to war.

4251
02:56:15,674 --> 02:56:16,653
Now you can see

4252
02:56:16,653 --> 02:56:19,522
that we have first
imported all the libraries

4253
02:56:19,522 --> 02:56:22,146
that we have set
or more indication system

4254
02:56:22,146 --> 02:56:24,806
and then we have moved
and kind of ecstatic.

4255
02:56:24,806 --> 02:56:27,900
The D string transformation
extractor that we write

4256
02:56:27,900 --> 02:56:29,900
and then save
the output final effect.

4257
02:56:29,900 --> 02:56:32,100
So these are the things
which we have done

4258
02:56:32,100 --> 02:56:36,000
in this program has now let's
execute it to run this program.

4259
02:56:36,000 --> 02:56:39,900
It's very simple go
to run as and from run

4260
02:56:39,900 --> 02:56:42,700
as click on still application.

4261
02:56:43,200 --> 02:56:45,276
You will notice in the end.

4262
02:56:45,276 --> 02:56:48,600
It is releasing
that great good to see that

4263
02:56:48,886 --> 02:56:51,286
so it is executing the program.

4264
02:56:51,286 --> 02:56:52,440
Let us execute.

4265
02:56:55,700 --> 02:56:57,800
I did bring a taxi for Trump.

4266
02:56:57,800 --> 02:57:01,292
So use these for Trump any way
that we surveyed to be negative.

4267
02:57:01,292 --> 02:57:01,629
Right?

4268
02:57:01,629 --> 02:57:02,654
It's an achievement

4269
02:57:02,654 --> 02:57:06,036
because anything you do for Tom
will be to be negative Trump is

4270
02:57:06,036 --> 02:57:07,563
anyway the hot topic for us.

4271
02:57:07,563 --> 02:57:09,200
Maybe make it a little bigger.

4272
02:57:14,100 --> 02:57:17,200
You will notice a lot
of negative tweets coming up on.

4273
02:57:24,700 --> 02:57:26,900
Yes, now, I'm just stopping it

4274
02:57:26,900 --> 02:57:28,742
so that I can
show you something.

4275
02:57:28,742 --> 02:57:28,972
Yes.

4276
02:57:28,972 --> 02:57:30,700
It's filtering that we thought

4277
02:57:30,800 --> 02:57:33,700
so we have actually been written
back in the program itself.

4278
02:57:33,700 --> 02:57:36,300
You have given
at one location from using

4279
02:57:36,300 --> 02:57:38,087
that we were kind of asking

4280
02:57:38,087 --> 02:57:41,200
for a treetop Tom now
here we are doing analysis

4281
02:57:41,200 --> 02:57:43,064
and it is also going to tell us

4282
02:57:43,064 --> 02:57:46,264
whether it's a positive to a
negative resistance is situated.

4283
02:57:46,264 --> 02:57:47,500
It is giving up Faith

4284
02:57:47,500 --> 02:57:50,444
because term for Transit even
will not quit positive rate.

4285
02:57:50,444 --> 02:57:51,454
So that's something

4286
02:57:51,454 --> 02:57:53,790
which is so that's
the reason you're finding.

4287
02:57:53,790 --> 02:57:54,800
This is a negative.

4288
02:57:54,900 --> 02:57:56,412
Similarly if there
will be any other

4289
02:57:56,412 --> 02:57:57,964
that we should
be getting a static.

4290
02:57:57,964 --> 02:58:00,200
So right now if I keep on
moving ahead we will see

4291
02:58:00,200 --> 02:58:02,300
multiple negative traits
which will come up.

4292
02:58:02,300 --> 02:58:04,600
So that's how this program runs.

4293
02:58:04,900 --> 02:58:07,000
So this is how our program

4294
02:58:07,000 --> 02:58:09,403
we will be executing
we can distract it.

4295
02:58:09,403 --> 02:58:13,100
Even the output results will be
getting through at a location

4296
02:58:13,100 --> 02:58:16,500
as you can see in this
if I go to my location here,

4297
02:58:16,500 --> 02:58:19,100
this is my actual project
where it is running

4298
02:58:19,100 --> 02:58:20,533
so you can just come

4299
02:58:20,533 --> 02:58:23,400
to this location here
are on your output.

4300
02:58:23,400 --> 02:58:24,982
All your output
is Getting through there

4301
02:58:24,982 --> 02:58:26,200
so you can just take a look as

4302
02:58:26,200 --> 02:58:28,200
but yes, so it's
everything is done

4303
02:58:28,200 --> 02:58:29,971
by using space thing apart.

4304
02:58:29,971 --> 02:58:30,300
Okay.

4305
02:58:30,300 --> 02:58:31,900
That's what we've
seen right reverse

4306
02:58:31,900 --> 02:58:33,653
that we were seeing
it with respect

4307
02:58:33,653 --> 02:58:35,200
to these three transformations

4308
02:58:35,200 --> 02:58:38,300
in a so we have done all that
with have both passed anybody.

4309
02:58:38,400 --> 02:58:41,200
So that is one
of those awesome part about this

4310
02:58:41,200 --> 02:58:44,700
that you can do such
a powerful things with respect

4311
02:58:44,700 --> 02:58:47,279
to your with respect
to you this way.

4312
02:58:47,279 --> 02:58:49,500
Now, let's analyze the results.

4313
02:58:49,800 --> 02:58:51,152
So as we have just seen

4314
02:58:51,152 --> 02:58:53,400
that it is showing
the president's a positive

4315
02:58:53,400 --> 02:58:54,800
to a negative tweets.

4316
02:58:55,000 --> 02:58:57,200
So this is where your output
is getting Stone

4317
02:58:57,200 --> 02:59:00,000
as it shown you a doubt
will appear like this.

4318
02:59:00,000 --> 02:59:00,300
Okay.

4319
02:59:00,300 --> 02:59:02,700
This is just broke
your output to explicitly

4320
02:59:02,700 --> 02:59:03,762
principal also tell

4321
02:59:03,762 --> 02:59:05,848
whether it's a neutral
one positive one

4322
02:59:05,848 --> 02:59:07,277
negative one everything.

4323
02:59:07,277 --> 02:59:09,600
We have done it
with the help of Sparks.

4324
02:59:09,600 --> 02:59:12,000
I mean only now we
have done it for Trump

4325
02:59:12,000 --> 02:59:14,000
as I just explained you
that we have put

4326
02:59:14,000 --> 02:59:15,555
in our program itself from

4327
02:59:15,555 --> 02:59:17,589
like we have put
everything up here

4328
02:59:17,589 --> 02:59:21,000
and based on that only we
are getting all the software now

4329
02:59:21,000 --> 02:59:23,498
we can apply all
the sentiment analysis

4330
02:59:23,498 --> 02:59:24,403
and like this.

4331
02:59:24,403 --> 02:59:25,731
Like we have learned.

4332
02:59:25,731 --> 02:59:28,754
So I hope you have found
all this this specially

4333
02:59:28,754 --> 02:59:30,593
this use case very much useful

4334
02:59:30,593 --> 02:59:32,800
for you kind of
getting you that yes,

4335
02:59:32,800 --> 02:59:34,388
it is getting done by half.

4336
02:59:34,388 --> 02:59:36,200
But right now we
have put from here,

4337
02:59:36,200 --> 02:59:38,550
but if you want you can keep
on putting the hashtag as

4338
02:59:38,550 --> 02:59:40,286
well because that's
how we are doing it.

4339
02:59:40,286 --> 02:59:41,886
You can keep on
changing the tax.

4340
02:59:41,886 --> 02:59:44,335
Maybe you can kind of code
for let's say four people

4341
02:59:44,335 --> 02:59:45,200
for stuff is going

4342
02:59:45,200 --> 02:59:49,000
on a cricket match will be going
on we can just put the tweets

4343
02:59:49,000 --> 02:59:52,300
according to that just take the
in that case instead of trump.

4344
02:59:52,300 --> 02:59:53,980
You can put any player named

4345
02:59:53,980 --> 02:59:56,432
or maybe a Team name
and you will see all

4346
02:59:56,432 --> 02:59:58,300
that friendly becoming a father.

4347
02:59:58,300 --> 03:00:00,700
Okay, so that's
how you can play with this.

4348
03:00:01,000 --> 03:00:01,500
Now.

4349
03:00:01,800 --> 03:00:04,400
This is there are
multiple examples with it,

4350
03:00:04,400 --> 03:00:05,400
which we can play

4351
03:00:05,500 --> 03:00:09,500
and this new skills can be even
evolved multiple other type

4352
03:00:09,500 --> 03:00:10,250
of those cases.

4353
03:00:10,250 --> 03:00:12,200
You can just keep
on transforming it

4354
03:00:12,200 --> 03:00:14,300
according to your own use cases.

4355
03:00:14,400 --> 03:00:17,800
So that's it about Sparks coming
which I wanted to discuss.

4356
03:00:17,800 --> 03:00:21,000
So I hope you must
have found it useful.

4357
03:00:26,000 --> 03:00:28,228
So in classification generally

4358
03:00:28,228 --> 03:00:31,200
what happens just
to give you an example.

4359
03:00:31,300 --> 03:00:33,867
You must have notice
the spam email box.

4360
03:00:33,867 --> 03:00:36,500
I hope everybody
must be having have seen

4361
03:00:36,500 --> 03:00:39,700
that sparkle in your spam
email box Energy Mix.

4362
03:00:39,800 --> 03:00:45,000
Now when any new email comes up
how Google decide

4363
03:00:45,165 --> 03:00:49,134
whether it's a spam email
or unknown stamped image

4364
03:00:49,300 --> 03:00:53,400
that is done as an example
of classification plus 3,

4365
03:00:53,576 --> 03:00:56,423
let's say My ghost
in the Google news,

4366
03:00:56,500 --> 03:00:58,794
when you type
something it group.

4367
03:00:58,794 --> 03:01:00,300
All the news together

4368
03:01:00,300 --> 03:01:04,700
that is called your electric
regression equation is also one

4369
03:01:04,700 --> 03:01:07,300
of the very important
fact it is not here.

4370
03:01:07,500 --> 03:01:11,700
The regression is let's say
you have a house

4371
03:01:11,900 --> 03:01:14,100
and you want to sell that house

4372
03:01:14,400 --> 03:01:16,500
and you have no idea.

4373
03:01:16,700 --> 03:01:18,715
What is the optimal price?

4374
03:01:18,715 --> 03:01:21,100
You should keep for your house.

4375
03:01:21,100 --> 03:01:24,400
Now this regression
will help you too.

4376
03:01:24,400 --> 03:01:28,534
To achieve that collaborative
filtering you might have see

4377
03:01:28,534 --> 03:01:31,000
when you go
to your Amazon web page

4378
03:01:31,000 --> 03:01:33,400
that they show you
a recommendation, right?

4379
03:01:33,400 --> 03:01:34,430
You can buy this

4380
03:01:34,430 --> 03:01:38,400
because you are buying this
but this is done with the help

4381
03:01:38,400 --> 03:01:40,900
of colaborative filtering.

4382
03:01:42,028 --> 03:01:44,315
Before I move to the project,

4383
03:01:44,315 --> 03:01:47,700
I want to show you
some practical find how we

4384
03:01:47,700 --> 03:01:50,300
will be executing spark things.

4385
03:01:50,503 --> 03:01:53,196
So let me take you
to the VM machine

4386
03:01:53,300 --> 03:01:55,300
which will be provided
by a Dorita.

4387
03:01:55,300 --> 03:01:57,928
So this machines are also
provided by the Rekha.

4388
03:01:57,928 --> 03:02:00,222
So you need not worry
about from where I

4389
03:02:00,222 --> 03:02:01,963
will be getting the software.

4390
03:02:01,963 --> 03:02:04,421
What I will be doing
recite It Roll there.

4391
03:02:04,421 --> 03:02:07,300
Everything is taken care back
into they come now.

4392
03:02:07,300 --> 03:02:08,957
Once you will be coming

4393
03:02:08,957 --> 03:02:12,059
to this you will see
a machine like Like this,

4394
03:02:12,059 --> 03:02:13,300
let me close this.

4395
03:02:13,300 --> 03:02:16,970
So what will happen you will see
a blank machine like this.

4396
03:02:16,970 --> 03:02:18,300
Let me show you this.

4397
03:02:18,300 --> 03:02:20,500
So this is how your machine
will look like.

4398
03:02:20,500 --> 03:02:24,100
Now what you are going to do
in order to start working.

4399
03:02:24,100 --> 03:02:26,600
You will be opening
this permanent by clicking

4400
03:02:26,600 --> 03:02:27,800
on this black option.

4401
03:02:28,000 --> 03:02:29,300
Now after that,

4402
03:02:29,400 --> 03:02:34,400
what you can do is you
can now go to your spot now

4403
03:02:34,400 --> 03:02:39,300
how I can work with funds
in order to execute any program

4404
03:02:39,300 --> 03:02:43,000
in sparked by using
Funeral program you

4405
03:02:43,000 --> 03:02:46,700
will be entering it as fast -

4406
03:02:46,700 --> 03:02:49,400
Chanel if you type fast - gel

4407
03:02:49,500 --> 03:02:52,500
it will take you
to the scale of Ron

4408
03:02:52,800 --> 03:02:55,800
where you can write
your path program,

4409
03:02:56,100 --> 03:03:00,020
but by using scale
of programming language,

4410
03:03:00,020 --> 03:03:01,558
you can notice this.

4411
03:03:02,200 --> 03:03:06,300
Now, can you see the fact it
is also giving me 1.5.2 version.

4412
03:03:06,300 --> 03:03:09,200
So that is the version
of your spot.

4413
03:03:09,800 --> 03:03:11,400
Now you can see here.

4414
03:03:11,400 --> 03:03:15,200
You can also see this part of
our context available as a see

4415
03:03:15,200 --> 03:03:17,752
when you get connected
to your spark sure.

4416
03:03:17,752 --> 03:03:21,441
You can just see this will be
my default available to you.

4417
03:03:21,441 --> 03:03:22,800
Let us get connected.

4418
03:03:22,800 --> 03:03:23,800
It is sometime.

4419
03:03:39,207 --> 03:03:40,746
No, we got anything.

4420
03:03:40,746 --> 03:03:43,900
So we got connected
to this Kayla prom now

4421
03:03:43,900 --> 03:03:45,894
if I want to come out of it,

4422
03:03:45,894 --> 03:03:49,300
I will just type exit
it will just let me come

4423
03:03:49,300 --> 03:03:51,400
out of this product now.

4424
03:03:52,100 --> 03:03:56,176
Secondly, I can also write
my programs with my python.

4425
03:03:56,176 --> 03:03:57,407
So what I can do

4426
03:03:57,500 --> 03:04:00,200
if I want to do
programming and Spark,

4427
03:04:00,200 --> 03:04:03,040
but with provide
Python programming language,

4428
03:04:03,040 --> 03:04:05,300
I will be connecting
with by Sparks.

4429
03:04:05,300 --> 03:04:09,148
So I just need to type ice pack
in order to get connected.

4430
03:04:09,148 --> 03:04:09,912
Your fighter.

4431
03:04:09,912 --> 03:04:10,206
Okay.

4432
03:04:10,206 --> 03:04:11,791
I'm not getting connected now

4433
03:04:11,791 --> 03:04:13,576
because I'm not
going to require.

4434
03:04:13,576 --> 03:04:16,700
I think I will be explaining
everything that scalar item.

4435
03:04:16,700 --> 03:04:19,700
But if you want to get connected
you can type icebox.

4436
03:04:19,700 --> 03:04:21,100
So let's again get connected

4437
03:04:21,100 --> 03:04:23,900
to my staff -
sure now meanwhile,

4438
03:04:23,900 --> 03:04:25,800
this is getting connected.

4439
03:04:25,800 --> 03:04:27,800
Let us create a small pipe.

4440
03:04:27,800 --> 03:04:29,823
So let us create
a file so currently

4441
03:04:29,823 --> 03:04:31,897
if you notice I
don't have any file.

4442
03:04:31,897 --> 03:04:32,281
Okay.

4443
03:04:32,284 --> 03:04:34,300
I already have a DOT txt.

4444
03:04:34,300 --> 03:04:37,300
So let's say sake at a DOT txt.

4445
03:04:37,400 --> 03:04:38,958
So I have some data one.

4446
03:04:38,958 --> 03:04:40,200
Two three four five.

4447
03:04:40,200 --> 03:04:42,362
This is my data,
which is with me.

4448
03:04:42,362 --> 03:04:44,000
Now what I'm going to do,

4449
03:04:44,000 --> 03:04:47,900
let me push this file
and do select the effective

4450
03:04:47,900 --> 03:04:49,900
if it is already available

4451
03:04:49,900 --> 03:04:55,000
in my system as that means
SDK system Hadoop DFS -

4452
03:04:55,000 --> 03:04:57,900
ooh, Jack a dot txt just
to quickly check

4453
03:04:57,900 --> 03:04:59,700
if it is already available.

4454
03:05:06,100 --> 03:05:09,400
There is no sex by so let
me first put this file

4455
03:05:09,400 --> 03:05:12,700
to my system to put a dot txt.

4456
03:05:14,200 --> 03:05:16,300
So this will put it
in the default location

4457
03:05:16,300 --> 03:05:17,200
of x g of X.

4458
03:05:17,200 --> 03:05:19,700
Now if I want to read it,
I can see the specs.

4459
03:05:19,700 --> 03:05:20,922
So again, I'm assuming

4460
03:05:20,922 --> 03:05:23,700
that you're aware of this
as big as commands so you

4461
03:05:23,700 --> 03:05:25,300
can see now this one two,

4462
03:05:25,300 --> 03:05:28,500
three four Pilots coming
from a Hadoop file system.

4463
03:05:28,500 --> 03:05:30,192
Now what I want to do,

4464
03:05:30,192 --> 03:05:36,400
I want to use this file
in my in my system of spa now

4465
03:05:36,400 --> 03:05:39,200
how I can do that select
we come here.

4466
03:05:39,200 --> 03:05:42,500
So in skaila in skaila,

4467
03:05:42,500 --> 03:05:46,000
we do not have any Your float
and on like in Java

4468
03:05:46,000 --> 03:05:48,700
we use the Define
like this right integer

4469
03:05:48,700 --> 03:05:49,907
K is equal to 10

4470
03:05:49,907 --> 03:05:53,000
like this is used
to define buttons Kayla.

4471
03:05:53,000 --> 03:05:55,400
We do not use this data type.

4472
03:05:55,473 --> 03:05:58,626
In fact, what we do
is we call it as back.

4473
03:05:58,700 --> 03:06:02,000
So if I use
that a is equal to 10,

4474
03:06:02,100 --> 03:06:04,700
it will automatically identify

4475
03:06:04,900 --> 03:06:08,100
that it is
a integer value notice.

4476
03:06:08,900 --> 03:06:13,303
It will tell me that
a is of my integer type now

4477
03:06:13,303 --> 03:06:16,072
if I want to Update
this value to 20.

4478
03:06:16,072 --> 03:06:17,149
I can do that.

4479
03:06:17,400 --> 03:06:17,800
Now.

4480
03:06:17,900 --> 03:06:20,900
Let's say if I want to update
this to ABC like this.

4481
03:06:21,200 --> 03:06:23,700
This will smoke an error by

4482
03:06:23,900 --> 03:06:27,400
because a is already
defined as in danger

4483
03:06:27,600 --> 03:06:31,300
and you're trying to assign
some PVC string back.

4484
03:06:31,300 --> 03:06:34,000
So that is the reason
you got this error.

4485
03:06:34,000 --> 03:06:34,900
Similarly.

4486
03:06:35,000 --> 03:06:38,000
There is one more thing
called as value.

4487
03:06:38,300 --> 03:06:40,300
Well B is equal to 10.

4488
03:06:40,300 --> 03:06:44,200
Let's say if I do it works
exactly a similar to that.

4489
03:06:44,200 --> 03:06:47,500
But I have one difference
now in this case.

4490
03:06:47,500 --> 03:06:51,600
If I do basic want
to 20 you will see an error

4491
03:06:51,800 --> 03:06:57,000
and why does Sarah because when
you define something as well,

4492
03:06:57,200 --> 03:06:59,200
it is a constant.

4493
03:06:59,300 --> 03:07:02,400
It is not going
to be variable anymore.

4494
03:07:02,430 --> 03:07:04,046
It will be a constant

4495
03:07:04,046 --> 03:07:08,300
and that is the reason
if you define something as well,

4496
03:07:08,300 --> 03:07:10,700
it will be not updatable.

4497
03:07:10,700 --> 03:07:14,400
You will be should not be able
to update that value.

4498
03:07:14,400 --> 03:07:19,400
So this is how in Fela you
will be doing your program

4499
03:07:19,700 --> 03:07:23,969
so back for bearable part
of that for your constant,

4500
03:07:23,969 --> 03:07:27,200
but now so you will be
doing like this now,

4501
03:07:27,200 --> 03:07:31,664
let's use it for the example
what we have learned now.

4502
03:07:31,664 --> 03:07:34,971
Let's say if I want
to create and cut the V.

4503
03:07:35,100 --> 03:07:40,100
So Bal number is equal
to SC dot txt file.

4504
03:07:40,100 --> 03:07:43,000
Remember this API we
have learned the CPI

4505
03:07:43,000 --> 03:07:45,500
already St. Dot Txt file now.

4506
03:07:45,500 --> 03:07:49,300
Let me give this file a DOT txt.

4507
03:07:49,500 --> 03:07:52,000
If I give this file a dot txt.

4508
03:07:52,300 --> 03:07:55,900
It will be creating
an ID will see this file.

4509
03:07:55,900 --> 03:07:57,000
It is telling

4510
03:07:57,000 --> 03:08:00,800
that I created an rdd
of string type.

4511
03:08:01,100 --> 03:08:01,300
Now.

4512
03:08:01,300 --> 03:08:06,600
If I want to read this data,
I will call number dot connect.

4513
03:08:06,800 --> 03:08:10,415
This will print be the value
what was available.

4514
03:08:10,415 --> 03:08:14,261
Can you say now this line
what you are seeing here?

4515
03:08:14,300 --> 03:08:17,300
Is going to be from your memory.

4516
03:08:17,400 --> 03:08:19,382
This is your from my body.

4517
03:08:19,382 --> 03:08:23,500
It is reading a and that is
the reason it is showing up

4518
03:08:23,500 --> 03:08:25,800
in this particular manner.

4519
03:08:25,842 --> 03:08:29,457
So this is how you
will be performing your step.

4520
03:08:29,484 --> 03:08:30,715
No second thing.

4521
03:08:31,100 --> 03:08:36,000
I told you that sparked and walk
on Standalone systems as well.

4522
03:08:36,100 --> 03:08:36,400
Right?

4523
03:08:36,400 --> 03:08:38,400
So right now
what was happening was

4524
03:08:38,400 --> 03:08:42,000
that we have executed this part
in our history of this now

4525
03:08:42,000 --> 03:08:46,283
if I want to execute this Us
on our local file system.

4526
03:08:46,283 --> 03:08:47,338
Can I do that?

4527
03:08:47,338 --> 03:08:49,300
Yes, it can still do that.

4528
03:08:49,300 --> 03:08:51,300
What you need to do for that.

4529
03:08:51,300 --> 03:08:54,700
So is in that case
the difference will come here.

4530
03:08:54,700 --> 03:08:57,000
Now what the file you are giving

4531
03:08:57,000 --> 03:08:59,748
here would be instead
of giving like that.

4532
03:08:59,748 --> 03:09:03,100
You will be denoting
this file keyword before that.

4533
03:09:03,100 --> 03:09:06,300
And after that you need
to give you a local file.

4534
03:09:06,300 --> 03:09:09,200
For example, what is
this part slash home slash.

4535
03:09:09,200 --> 03:09:09,900
Advocacy.

4536
03:09:09,900 --> 03:09:12,400
This is a local park
not as deep as possible.

4537
03:09:12,400 --> 03:09:14,400
So you will be
writing / foam.

4538
03:09:14,400 --> 03:09:17,400
/schedule Erica a DOT PSD.

4539
03:09:17,500 --> 03:09:19,100
Now if you give this

4540
03:09:19,300 --> 03:09:22,700
this will be loading
the file into memory,

4541
03:09:23,000 --> 03:09:26,300
but not from your hdfs instead.

4542
03:09:26,300 --> 03:09:29,100
What does that is this loaded it

4543
03:09:29,100 --> 03:09:33,000
from your just loaded it
formula looks like this

4544
03:09:33,200 --> 03:09:34,921
so that is the defensive.

4545
03:09:34,921 --> 03:09:37,600
So as you can see
in the second case,

4546
03:09:37,600 --> 03:09:41,600
I am not even using my hdfs.

4547
03:09:41,700 --> 03:09:43,000
Which means what now?

4548
03:09:43,000 --> 03:09:46,000
Can you tell me why this
Sarah this is interesting.

4549
03:09:46,000 --> 03:09:49,300
Why do Sarah input path
does not exist

4550
03:09:49,300 --> 03:09:51,600
because I have given
a typo here.

4551
03:09:51,600 --> 03:09:52,400
Okay.

4552
03:09:52,400 --> 03:09:53,595
Now if you notice

4553
03:09:53,595 --> 03:09:58,555
by I did not get this error here
why I did not get this Elijah

4554
03:09:58,555 --> 03:10:00,200
this file do not exist.

4555
03:10:00,200 --> 03:10:02,500
But still I did not got

4556
03:10:02,500 --> 03:10:07,300
any error because of
lazy evaluation link

4557
03:10:07,300 --> 03:10:11,500
the evaluation kind
of made sure that even

4558
03:10:11,500 --> 03:10:14,400
if you have given
the wrong part in creating

4559
03:10:14,400 --> 03:10:18,200
And beyond ready but it
has not executed anything.

4560
03:10:18,400 --> 03:10:19,900
So all the output

4561
03:10:19,900 --> 03:10:22,800
or the error mistake
you are able to receive

4562
03:10:22,800 --> 03:10:25,600
when you hit that action
of Collective Now

4563
03:10:25,600 --> 03:10:27,997
in order to correct this value.

4564
03:10:27,997 --> 03:10:32,890
I need to connect this adorable
and this time if I execute it,

4565
03:10:32,975 --> 03:10:33,975
it will work.

4566
03:10:34,050 --> 03:10:37,050
Okay, you can see
this output 1 2 3 4 5.

4567
03:10:37,100 --> 03:10:40,500
So this time it works
by so now we should be

4568
03:10:40,500 --> 03:10:44,200
more clear about the lazy
evaluation of the even

4569
03:10:44,200 --> 03:10:46,375
if you are giving
the wrong file name

4570
03:10:46,375 --> 03:10:47,628
doesn't matter suppose.

4571
03:10:47,628 --> 03:10:49,804
I want to use Park
in production unit,

4572
03:10:49,804 --> 03:10:51,155
but not on top of Hadoop.

4573
03:10:51,155 --> 03:10:52,007
Is it possible?

4574
03:10:52,007 --> 03:10:53,200
Yes, you can do that.

4575
03:10:53,200 --> 03:10:54,500
You can do that Sonny,

4576
03:10:54,500 --> 03:10:56,900
but usually that's
not what you do.

4577
03:10:56,900 --> 03:10:58,958
But yes, if you
want to can do that,

4578
03:10:58,958 --> 03:11:00,299
there are a lot of things

4579
03:11:00,299 --> 03:11:02,239
which you can view
can also deploy it

4580
03:11:02,239 --> 03:11:05,611
on your Amazon clusters as that
lot of things you can do that.

4581
03:11:05,611 --> 03:11:07,900
How will it provided
distribute in that case?

4582
03:11:07,900 --> 03:11:10,186
We'll be using
some other distribution system.

4583
03:11:10,186 --> 03:11:12,425
So in that case you
are not using this fact,

4584
03:11:12,425 --> 03:11:14,300
you can deploy it
will be just death.

4585
03:11:14,300 --> 03:11:16,400
He will not be able
to kind of go across

4586
03:11:16,400 --> 03:11:17,698
and distribute in that Master.

4587
03:11:17,698 --> 03:11:19,849
You will not be able to lift
weight that redundancy,

4588
03:11:19,849 --> 03:11:22,500
but you can use them in Amazon
is the enough for that.

4589
03:11:22,500 --> 03:11:23,700
Okay, so that is

4590
03:11:23,700 --> 03:11:28,089
how you will be using this now
you're going to get so this is

4591
03:11:28,089 --> 03:11:31,600
how you will be performing
your practice as a sec

4592
03:11:31,600 --> 03:11:33,643
how you will be working
on this part.

4593
03:11:33,643 --> 03:11:35,800
I will be a training you
as I told you.

4594
03:11:35,800 --> 03:11:37,500
So this is how things work.

4595
03:11:37,700 --> 03:11:41,600
Now, let us see
an interesting use case.

4596
03:11:41,800 --> 03:11:43,900
So for that let us go back.

4597
03:11:43,900 --> 03:11:47,900
Back to our visiting this
is going to be very interesting.

4598
03:11:48,161 --> 03:11:50,238
So let's see this use case.

4599
03:11:50,600 --> 03:11:51,600
Look at this.

4600
03:11:51,900 --> 03:11:53,500
This is very interested.

4601
03:11:53,500 --> 03:11:57,600
Now this use case is for
earthquake detection using Spa.

4602
03:11:57,600 --> 03:12:00,200
So in Japan you
might have already seen

4603
03:12:00,200 --> 03:12:02,450
that there are so many
up to access coming you

4604
03:12:02,450 --> 03:12:03,800
might have thought about it.

4605
03:12:03,800 --> 03:12:05,591
I definitely you
might have not seen it

4606
03:12:05,591 --> 03:12:07,100
but you must have heard about it

4607
03:12:07,100 --> 03:12:09,200
that there are
so many earthquake

4608
03:12:09,200 --> 03:12:13,700
which happens in Japan now
how to solve that problem with

4609
03:12:13,700 --> 03:12:16,111
about I'm just going
to give you a glimpse

4610
03:12:16,111 --> 03:12:17,400
of what kind of problems

4611
03:12:17,400 --> 03:12:18,563
in solving the sessions

4612
03:12:18,563 --> 03:12:21,600
definitely we are not going to
walk through in detail of this

4613
03:12:21,600 --> 03:12:24,500
but you will get an idea
House of Prince fastest.

4614
03:12:24,500 --> 03:12:27,300
Okay, just to give you
a little bit of brief here.

4615
03:12:27,300 --> 03:12:30,500
But all these products
will learn at the time

4616
03:12:30,500 --> 03:12:31,900
of sessions now.

4617
03:12:32,000 --> 03:12:35,300
So let's see this part
how we will be using this bill.

4618
03:12:35,300 --> 03:12:38,500
So as everybody must be knowing
what is asked website.

4619
03:12:38,500 --> 03:12:39,800
So our crack is

4620
03:12:40,200 --> 03:12:44,028
like a shaking of your surface
of the Earth your own country.

4621
03:12:44,028 --> 03:12:46,900
Ignore all those events
that happen in tector.

4622
03:12:46,900 --> 03:12:48,050
If you're from India,

4623
03:12:48,050 --> 03:12:51,400
you might have seen recently
there was an earthquake incident

4624
03:12:51,400 --> 03:12:54,600
which came from Nepal
by even recently two days back.

4625
03:12:54,600 --> 03:12:56,900
Also there was upset incident.

4626
03:12:57,053 --> 03:12:59,900
So these are techniques
on coming now,

4627
03:12:59,900 --> 03:13:02,300
very important part is let's say

4628
03:13:02,300 --> 03:13:06,100
if the earthquake is
on major earthquake like arguing

4629
03:13:06,100 --> 03:13:08,992
or maybe tsunami
maybe forest fires,

4630
03:13:08,992 --> 03:13:10,600
maybe a volcano now,

4631
03:13:10,600 --> 03:13:14,000
it's very important
for them to kind of SC.

4632
03:13:15,100 --> 03:13:19,600
That black is going to come
they should be able to kind

4633
03:13:19,600 --> 03:13:21,600
of predicted beforehand.

4634
03:13:21,600 --> 03:13:23,776
It's not happen
that as a last moment.

4635
03:13:23,776 --> 03:13:25,254
They got to the that okay

4636
03:13:25,254 --> 03:13:27,862
Dirtbag is comes
after I came up cracking No,

4637
03:13:27,862 --> 03:13:29,700
it should not happen like that.

4638
03:13:29,700 --> 03:13:34,000
They should be able to estimate
all these things beforehand.

4639
03:13:34,000 --> 03:13:36,600
They should be able
to predict beforehand.

4640
03:13:36,688 --> 03:13:40,611
So this is the system
with Japan's is using already.

4641
03:13:40,700 --> 03:13:44,300
So this is a real-time kind of
use case what I am presenting.

4642
03:13:44,300 --> 03:13:47,300
It's so Japan is already
using this path finger

4643
03:13:47,300 --> 03:13:49,770
in order to solve
this earthquake problem.

4644
03:13:49,770 --> 03:13:52,482
We are going to see
that how they're using it.

4645
03:13:52,482 --> 03:13:52,866
Okay.

4646
03:13:52,900 --> 03:13:56,900
Now let's say what happens
in Japan earthquake model.

4647
03:13:57,000 --> 03:14:00,000
So whenever there is
an earthquake coming

4648
03:14:00,000 --> 03:14:02,000
for example at 2:46 p.m.

4649
03:14:02,000 --> 03:14:04,800
On March 4 2011 now

4650
03:14:04,800 --> 03:14:08,300
Japan earthquake early
warning was detected.

4651
03:14:08,600 --> 03:14:12,800
Now the thing was as
soon as it detected immediately,

4652
03:14:12,800 --> 03:14:16,999
they start sending
Not those fools to the lift

4653
03:14:17,000 --> 03:14:20,700
to the factories every station
through TV stations.

4654
03:14:20,700 --> 03:14:23,300
They immediately kind
of told everyone

4655
03:14:23,300 --> 03:14:26,315
so that all the students
were there in school.

4656
03:14:26,315 --> 03:14:29,800
They got the time to go
under the desk bullet trains,

4657
03:14:29,800 --> 03:14:30,900
which were running.

4658
03:14:30,900 --> 03:14:31,571
They stop.

4659
03:14:31,571 --> 03:14:35,200
Otherwise the capabilities
of us will start shaking now

4660
03:14:35,200 --> 03:14:38,200
the bullet trains are already
running at the very high speed.

4661
03:14:38,200 --> 03:14:39,432
They want to ensure

4662
03:14:39,432 --> 03:14:43,000
that there should be no sort
of casualty because of that

4663
03:14:43,000 --> 03:14:46,600
so all the bullet train Stop
all the elevators the lift

4664
03:14:46,600 --> 03:14:47,825
which were running.

4665
03:14:47,825 --> 03:14:50,600
They stop otherwise
some incident can happen

4666
03:14:50,700 --> 03:14:53,930
in 60 seconds 60 seconds

4667
03:14:53,930 --> 03:14:55,700
before this number they

4668
03:14:55,700 --> 03:14:59,100
were able to inform
almost every month.

4669
03:14:59,300 --> 03:15:01,212
They have send the message.

4670
03:15:01,212 --> 03:15:02,698
They have a broadcast

4671
03:15:02,698 --> 03:15:05,949
on TV all those things
they have done immediately

4672
03:15:05,949 --> 03:15:07,100
to all the people

4673
03:15:07,100 --> 03:15:09,856
so that they can send
at least this message

4674
03:15:09,856 --> 03:15:11,300
whoever can receive it

4675
03:15:11,300 --> 03:15:13,600
and that have saved millions

4676
03:15:13,600 --> 03:15:17,300
of So powerful they
were able to achieve

4677
03:15:17,300 --> 03:15:22,100
that they have done all this
with the help of Apache spark.

4678
03:15:22,192 --> 03:15:24,500
That is the most important job

4679
03:15:24,500 --> 03:15:27,900
how they've got you
can select everything

4680
03:15:27,900 --> 03:15:29,800
what they are doing there.

4681
03:15:29,800 --> 03:15:33,600
They are doing it
on the real time system, right?

4682
03:15:33,700 --> 03:15:35,690
They cannot just
collect the data

4683
03:15:35,690 --> 03:15:39,100
and then later the processes
they did everything as

4684
03:15:39,100 --> 03:15:40,300
a real-time system.

4685
03:15:40,300 --> 03:15:43,300
So they collected the data
immediately process it

4686
03:15:43,300 --> 03:15:45,004
and as soon has the detected

4687
03:15:45,004 --> 03:15:47,484
that has quick they
immediately inform the

4688
03:15:47,484 --> 03:15:49,381
in fact this happened in 2011.

4689
03:15:49,381 --> 03:15:52,100
Now they they start
using it very frequently

4690
03:15:52,100 --> 03:15:54,318
because Japan is one of the area

4691
03:15:54,318 --> 03:15:58,200
which is very frequently
of kind of affected by all this.

4692
03:15:58,200 --> 03:15:58,900
So as I said,

4693
03:15:58,900 --> 03:16:01,548
the main thing is we should be
able to process the data

4694
03:16:01,548 --> 03:16:02,449
and we are finding

4695
03:16:02,449 --> 03:16:04,900
that the bigger thing you
should be able to handle

4696
03:16:04,900 --> 03:16:06,400
the data from multiple sources

4697
03:16:06,400 --> 03:16:07,789
because data may be coming

4698
03:16:07,789 --> 03:16:10,882
from multiple sources may be
different different sources.

4699
03:16:10,882 --> 03:16:13,600
They might be suggesting some
of the other events.

4700
03:16:13,600 --> 03:16:16,305
It's because Which we
are predicting that okay,

4701
03:16:16,305 --> 03:16:17,770
this earthquake can happen.

4702
03:16:17,770 --> 03:16:19,729
It should be very
easy to use because

4703
03:16:19,729 --> 03:16:22,500
if it is very complicated
then in that case

4704
03:16:22,500 --> 03:16:23,500
for a user to use it

4705
03:16:23,500 --> 03:16:25,549
if you'd be very good
become competitive service.

4706
03:16:25,549 --> 03:16:27,600
You will not be able
to solve the problem.

4707
03:16:27,700 --> 03:16:29,200
Now even in the end

4708
03:16:29,200 --> 03:16:32,100
how to send the alert
message is important.

4709
03:16:32,100 --> 03:16:32,900
Okay.

4710
03:16:32,900 --> 03:16:36,000
So all those things
are taken care by your spark.

4711
03:16:36,000 --> 03:16:39,923
Now there are two kinds
of layer in your earthquake.

4712
03:16:40,100 --> 03:16:42,633
The number one layer
is a prime the way

4713
03:16:42,633 --> 03:16:43,900
and second is fake.

4714
03:16:43,900 --> 03:16:44,864
And we'll wait.

4715
03:16:44,864 --> 03:16:46,600
There are two kinds of wave

4716
03:16:46,600 --> 03:16:49,100
in an earthquake
Prime Z Wave is like

4717
03:16:49,100 --> 03:16:52,261
when the earthquake is
just about to start it start

4718
03:16:52,261 --> 03:16:53,400
to the city center

4719
03:16:53,400 --> 03:16:55,200
and it's vendor or Quake

4720
03:16:55,200 --> 03:16:59,100
is going to start secondary wave
is more severe than

4721
03:16:59,100 --> 03:17:01,400
which sparked after producing.

4722
03:17:01,400 --> 03:17:03,912
Now what happens
in secondary wheel is

4723
03:17:03,912 --> 03:17:06,900
when it's that start it
can do maximum damage

4724
03:17:06,900 --> 03:17:09,605
because primary ways you
can see the initial wave

4725
03:17:09,605 --> 03:17:11,900
but the second we
will be on top of that

4726
03:17:11,900 --> 03:17:14,800
so they will be some details
with respect to I 'm not going

4727
03:17:14,800 --> 03:17:15,800
in detail of that.

4728
03:17:15,800 --> 03:17:17,600
But yeah, there
will be some details

4729
03:17:17,600 --> 03:17:18,700
with respect to that.

4730
03:17:18,700 --> 03:17:21,700
Now what we are going
to do using Sparks.

4731
03:17:21,700 --> 03:17:23,907
We will be creating our arms.

4732
03:17:23,907 --> 03:17:26,799
So let's go and see
that in our machine

4733
03:17:26,799 --> 03:17:30,600
how we will be sick
calculating our Roc which using

4734
03:17:30,600 --> 03:17:33,600
which we will be solving
this problem later

4735
03:17:33,600 --> 03:17:36,524
and we will be calculating
this Roc with the help

4736
03:17:36,524 --> 03:17:37,500
of Apache spark.

4737
03:17:37,500 --> 03:17:39,729
Let us again come back
to this machine now

4738
03:17:39,729 --> 03:17:41,369
in order to walk on that.

4739
03:17:41,369 --> 03:17:43,600
Let's first exit
from this console.

4740
03:17:43,800 --> 03:17:48,300
Once you exit from this console
now what you're going to do.

4741
03:17:48,300 --> 03:17:51,900
I have already created
this project in kept it here

4742
03:17:51,900 --> 03:17:55,563
because we just want to give
you an overview of this.

4743
03:17:55,563 --> 03:17:57,900
Let me go to
my downloads section.

4744
03:17:57,900 --> 03:18:01,400
There is a project called
as Earth to so this is

4745
03:18:01,400 --> 03:18:03,400
your project initially

4746
03:18:03,500 --> 03:18:06,400
what all things you
will be having you

4747
03:18:06,400 --> 03:18:08,839
will not be having all
the things initial part.

4748
03:18:08,839 --> 03:18:09,900
So what will happen.

4749
03:18:09,900 --> 03:18:12,990
So let's say if I go
to my downloads from here,

4750
03:18:12,990 --> 03:18:14,200
I have worked too.

4751
03:18:14,200 --> 03:18:16,800
project Okay.

4752
03:18:16,800 --> 03:18:19,000
Now initially I
will not be having

4753
03:18:19,000 --> 03:18:22,300
this target directory project
directory bin directory.

4754
03:18:22,300 --> 03:18:25,400
We will be using
our SBT framework.

4755
03:18:25,400 --> 03:18:28,900
If you do not know SBP this
is the skill of Bill tooth

4756
03:18:28,900 --> 03:18:32,400
which takes care of all
your dependencies takes care

4757
03:18:32,400 --> 03:18:36,700
of all your dependencies are not
so it is very similar to Melvin

4758
03:18:36,700 --> 03:18:40,577
if you already know Megan you
this is because very similar

4759
03:18:40,577 --> 03:18:42,900
but at the same time
I prefer this BTW

4760
03:18:42,900 --> 03:18:46,100
because as BT is
more easier to write income.

4761
03:18:46,100 --> 03:18:47,700
I've been doing yoga never

4762
03:18:47,700 --> 03:18:50,700
so you will be writing
this bill taught as begins.

4763
03:18:50,700 --> 03:18:55,800
So this finally will provide you
build dot SBT now in this file,

4764
03:18:55,800 --> 03:18:57,255
you will be giving the name

4765
03:18:57,255 --> 03:18:59,700
of your project your
what's a version of is

4766
03:18:59,700 --> 03:19:02,800
because using version of scale
of what you are using.

4767
03:19:02,800 --> 03:19:05,385
What are the dependencies
you have with

4768
03:19:05,385 --> 03:19:09,400
what versions dependencies you
have like 4 stock 4 and using

4769
03:19:09,400 --> 03:19:11,194
1.5.2 version of stock.

4770
03:19:11,200 --> 03:19:15,100
So you are telling
that whatever in my program,

4771
03:19:15,150 --> 03:19:16,150
I am writing.

4772
03:19:16,200 --> 03:19:22,100
So if I require anything related
to spawn quote go and get it

4773
03:19:22,100 --> 03:19:27,400
from this website of dot Apache
dot box download it install it.

4774
03:19:27,800 --> 03:19:29,900
If I require any dependency

4775
03:19:29,900 --> 03:19:34,700
for spark streaming program for
this particular version 1.5.2.

4776
03:19:35,000 --> 03:19:37,700
Go to this website or this link

4777
03:19:37,700 --> 03:19:41,200
and executed similar theme
for Amanda password.

4778
03:19:41,200 --> 03:19:43,353
So you just telling them now

4779
03:19:43,400 --> 03:19:47,200
once you have done this you will
be creating a Folder structure.

4780
03:19:47,200 --> 03:19:49,200
Your folder structure
would be you need

4781
03:19:49,200 --> 03:19:50,722
to create a sassy folder.

4782
03:19:50,722 --> 03:19:51,393
After that.

4783
03:19:51,393 --> 03:19:54,612
You will be creating
a main folder from Main folder.

4784
03:19:54,612 --> 03:19:57,200
You will be creating
again a folder called

4785
03:19:57,200 --> 03:19:58,800
as Kayla now inside

4786
03:19:58,800 --> 03:20:01,100
that you will be
keeping your program.

4787
03:20:01,100 --> 03:20:03,300
So now here you will
be writing a program.

4788
03:20:03,300 --> 03:20:04,500
So you are writing you.

4789
03:20:04,500 --> 03:20:07,499
Can you see this screaming
to a scalar Network on scale

4790
03:20:07,499 --> 03:20:08,500
of our DOT Stella.

4791
03:20:08,500 --> 03:20:10,623
So let's keep it as
a black box for them.

4792
03:20:10,623 --> 03:20:12,730
So you will be writing
the code to achieve

4793
03:20:12,730 --> 03:20:14,083
this problem statement.

4794
03:20:14,083 --> 03:20:15,500
Now what we are going to do

4795
03:20:15,500 --> 03:20:20,200
that come out of this What
do you mean project folder

4796
03:20:20,400 --> 03:20:21,500
and from here?

4797
03:20:21,700 --> 03:20:24,400
We will be writing SBT packaged.

4798
03:20:24,500 --> 03:20:26,400
It will start downloading

4799
03:20:26,400 --> 03:20:29,700
with respect to your is beating
it will check your program.

4800
03:20:29,700 --> 03:20:31,900
Whatever dependency you require

4801
03:20:31,900 --> 03:20:35,750
for stock course starts
screaming stuck in the lift.

4802
03:20:35,750 --> 03:20:36,895
It will download

4803
03:20:36,895 --> 03:20:39,400
and install it it
will just download

4804
03:20:39,400 --> 03:20:42,200
and install it so we
are not going to execute it

4805
03:20:42,200 --> 03:20:43,900
because I've already
done it before

4806
03:20:43,900 --> 03:20:45,300
and it also takes some time.

4807
03:20:45,300 --> 03:20:48,453
So that's the reason
I'm not doing it now.

4808
03:20:48,500 --> 03:20:50,689
You have been this packet,

4809
03:20:50,700 --> 03:20:53,788
you will find all
this directly Target directly

4810
03:20:53,788 --> 03:20:55,400
toward project directed.

4811
03:20:55,400 --> 03:20:58,100
These got created
later on the now

4812
03:20:58,100 --> 03:20:59,600
what is going to happen.

4813
03:20:59,600 --> 03:21:03,400
Once you have created this
you will go to your Eclipse.

4814
03:21:03,400 --> 03:21:04,900
So you are a pure c will open.

4815
03:21:04,900 --> 03:21:06,600
So let me open my Eclipse.

4816
03:21:06,900 --> 03:21:08,995
So this is how you're
equipped to protect.

4817
03:21:08,995 --> 03:21:09,200
Now.

4818
03:21:09,200 --> 03:21:11,300
I already have this program
in front of me,

4819
03:21:11,300 --> 03:21:14,900
but let me tell you how you
will be bringing this program.

4820
03:21:14,900 --> 03:21:17,800
You will be going
to your import option

4821
03:21:17,800 --> 03:21:18,934
with We import you

4822
03:21:18,934 --> 03:21:22,400
will be selecting your existing
projects into workspace.

4823
03:21:22,400 --> 03:21:23,700
Next once you do

4824
03:21:23,700 --> 03:21:26,400
that you need to select
your main project.

4825
03:21:26,400 --> 03:21:29,000
For example, you need
to select this Earth to project

4826
03:21:29,000 --> 03:21:31,900
what you have created
and click on OK

4827
03:21:31,900 --> 03:21:32,709
once you do

4828
03:21:32,709 --> 03:21:35,872
that they will be
a project directory coming

4829
03:21:35,872 --> 03:21:38,300
from this Earth
to will come here.

4830
03:21:38,300 --> 03:21:41,700
Now what we need to do go
to your s RC / Main

4831
03:21:41,700 --> 03:21:43,628
and not ignore all this program.

4832
03:21:43,628 --> 03:21:46,400
I require only just are jocular
because this is

4833
03:21:46,400 --> 03:21:48,500
where I've written
my main function.

4834
03:21:48,500 --> 03:21:50,260
Important now after that

4835
03:21:50,260 --> 03:21:52,900
once you reach
to this you need to go

4836
03:21:52,900 --> 03:21:55,900
to your run as Kayla application

4837
03:21:56,100 --> 03:21:59,600
and your spot code
will start to execute now,

4838
03:21:59,600 --> 03:22:01,800
this will return me a row 0.

4839
03:22:02,000 --> 03:22:02,314
Okay.

4840
03:22:02,314 --> 03:22:03,700
Let's see this output.

4841
03:22:06,600 --> 03:22:08,200
Now if I see this,

4842
03:22:08,200 --> 03:22:11,800
this will show me
once it's finished executing.

4843
03:22:22,900 --> 03:22:26,300
See this our area
under carosi is this

4844
03:22:26,300 --> 03:22:29,107
so this is all computed
with the elbows path program.

4845
03:22:29,107 --> 03:22:29,695
Similarly.

4846
03:22:29,695 --> 03:22:32,100
There are other programs
also met will help you

4847
03:22:32,100 --> 03:22:33,400
to spin the data or not.

4848
03:22:33,509 --> 03:22:35,010
I'm not walking over all that.

4849
03:22:35,160 --> 03:22:39,000
Now, let's come back
to my wedding and see

4850
03:22:39,000 --> 03:22:40,900
that what is the next step

4851
03:22:40,900 --> 03:22:44,500
what we will be doing so you
can see this way will be next.

4852
03:22:44,500 --> 03:22:48,200
Is she getting created now,
I'm keeping my Roc here.

4853
03:22:48,200 --> 03:22:53,100
Now after you have created
your RZ you will be Our graph

4854
03:22:53,100 --> 03:22:56,200
now in Japan there is
one important thing.

4855
03:22:56,200 --> 03:22:59,771
Japan is already
of affected area of your organs.

4856
03:22:59,771 --> 03:23:01,714
And now the trouble here is

4857
03:23:01,714 --> 03:23:05,600
that whatever it's not the even
for a minor earthquake.

4858
03:23:05,600 --> 03:23:07,852
I should start sending
the alert right?

4859
03:23:07,852 --> 03:23:11,300
I don't want to do all that
for the minor minor affection.

4860
03:23:11,300 --> 03:23:14,100
In fact, the buildings
and the infrastructure.

4861
03:23:14,100 --> 03:23:17,300
What is created is
the point is in such a way

4862
03:23:17,300 --> 03:23:18,600
if any odd quack

4863
03:23:18,600 --> 03:23:21,700
below six magnitude
comes there there.

4864
03:23:22,000 --> 03:23:25,713
The phones are designed in a way
that they will be no damage.

4865
03:23:25,713 --> 03:23:27,400
They will be no damage them.

4866
03:23:27,400 --> 03:23:29,400
So this is the major thing

4867
03:23:29,400 --> 03:23:33,300
when you work with your Japan
free book now in Japan,

4868
03:23:33,300 --> 03:23:36,000
so that means with six
they are not even worried

4869
03:23:36,000 --> 03:23:37,300
but about six they

4870
03:23:37,300 --> 03:23:40,668
are worried now for that day
will be a graph simulation

4871
03:23:40,668 --> 03:23:43,600
what you can do you can do it
with Park as well.

4872
03:23:43,600 --> 03:23:47,800
Once you generate this graph you
will be seeing that anything

4873
03:23:47,800 --> 03:23:49,449
which is going above 6

4874
03:23:49,449 --> 03:23:52,000
if anything which
is going above 6,

4875
03:23:52,000 --> 03:23:55,400
Should immediately start
the vendor now ignore all

4876
03:23:55,400 --> 03:23:56,700
this programming side

4877
03:23:56,700 --> 03:23:59,800
because that is what we
have just created and showing

4878
03:23:59,800 --> 03:24:01,411
you this execution fact now

4879
03:24:01,411 --> 03:24:03,800
if you have to visualize
the same result,

4880
03:24:03,800 --> 03:24:05,200
this is what is happening.

4881
03:24:05,200 --> 03:24:07,300
This is showing my Roc but

4882
03:24:07,300 --> 03:24:11,800
if my artwork is going
to be greater than 6 then only

4883
03:24:11,800 --> 03:24:16,415
weighs those alert then only
send the alert to all the paper.

4884
03:24:16,415 --> 03:24:18,400
Otherwise take come

4885
03:24:18,600 --> 03:24:22,000
that is what the project
what we generally show.

4886
03:24:22,000 --> 03:24:25,563
Oh in our space program sent now
it is not the only project

4887
03:24:25,563 --> 03:24:28,900
we also kind of create
multiple other products as well.

4888
03:24:28,900 --> 03:24:31,600
For example, I kind
of create a model just

4889
03:24:31,600 --> 03:24:33,204
like how Walmart to it

4890
03:24:33,204 --> 03:24:35,100
how Walmart maybe creating

4891
03:24:35,100 --> 03:24:38,241
a whatever sales is happening
with respect to that.

4892
03:24:38,241 --> 03:24:39,743
They're using Apache spark

4893
03:24:39,743 --> 03:24:43,000
and at the end they are kind of
making you visualize the output

4894
03:24:43,000 --> 03:24:45,400
of doing whatever
analytics they're doing.

4895
03:24:45,400 --> 03:24:46,900
So that is ordering the spark.

4896
03:24:46,900 --> 03:24:48,900
So all those things
we walking through

4897
03:24:48,900 --> 03:24:52,252
when we do the per session all
the things you learn quick.

4898
03:24:52,252 --> 03:24:55,100
I feel that all these projects
are using right now,

4899
03:24:55,100 --> 03:24:56,700
since you do not know the topic

4900
03:24:56,700 --> 03:24:59,400
you are not able to get
hundred percent of the project.

4901
03:24:59,400 --> 03:25:00,434
But at that time

4902
03:25:00,434 --> 03:25:03,366
once you know each
and every topics of deadly

4903
03:25:03,366 --> 03:25:07,100
you will have a clearer picture
of how spark is handling.

4904
03:25:07,100 --> 03:25:15,000
All these use cases graphs
are very attractive

4905
03:25:15,000 --> 03:25:17,900
when it comes to modeling
real world data

4906
03:25:17,900 --> 03:25:19,900
because they are
intuitive flexible

4907
03:25:19,900 --> 03:25:23,100
and the theory supporting
them has Been maturing

4908
03:25:23,100 --> 03:25:25,209
for centuries welcome everyone

4909
03:25:25,209 --> 03:25:27,600
in today's session
on Spa Graphics.

4910
03:25:27,700 --> 03:25:30,700
So without any further delay,
let's look at the agenda first.

4911
03:25:31,500 --> 03:25:34,561
We start by understanding
the basics of craft Theory

4912
03:25:34,561 --> 03:25:36,229
and different types of craft.

4913
03:25:36,229 --> 03:25:38,806
Then we'll look
at the features of Graphics

4914
03:25:38,806 --> 03:25:40,170
further will understand

4915
03:25:40,170 --> 03:25:43,820
what is property graph and look
at various crafts operations.

4916
03:25:43,820 --> 03:25:44,594
Moving ahead.

4917
03:25:44,594 --> 03:25:48,258
We'll look at different graph
processing algorithms at last.

4918
03:25:48,258 --> 03:25:49,500
We'll look at a demo

4919
03:25:49,500 --> 03:25:52,400
where we will try
to analyze Ford's go by

4920
03:25:52,400 --> 03:25:54,700
data using pagerank algorithm.

4921
03:25:54,700 --> 03:25:56,800
Let's move to the first topic.

4922
03:25:57,200 --> 03:25:59,845
So we'll start
with basics of graph.

4923
03:25:59,845 --> 03:26:03,661
So graphs are I basically
made up of two sets called

4924
03:26:03,661 --> 03:26:05,089
vertices and edges.

4925
03:26:05,089 --> 03:26:08,704
The vertices are drawn
from some underlying type

4926
03:26:08,704 --> 03:26:11,550
and the set can be
finite or infinite.

4927
03:26:11,550 --> 03:26:12,900
Now each element

4928
03:26:12,900 --> 03:26:17,035
of the edge set is a pair
consisting of two elements

4929
03:26:17,035 --> 03:26:18,728
from the vertices set.

4930
03:26:18,900 --> 03:26:21,400
So your vertex is V1.

4931
03:26:21,403 --> 03:26:23,173
Then your vertex is V3.

4932
03:26:23,173 --> 03:26:25,480
Then your vertex is V2 and V4.

4933
03:26:25,700 --> 03:26:29,300
And your edges are V
1 comma V 3 then next

4934
03:26:29,300 --> 03:26:33,500
is V 1 comma V 2 Then
you have B2 comma V 3

4935
03:26:33,500 --> 03:26:34,961
and then you have V

4936
03:26:34,961 --> 03:26:38,807
2 comma V fo so basically
we represent vertices set

4937
03:26:38,807 --> 03:26:43,000
as closed in curly braces
all the name of vertices.

4938
03:26:43,100 --> 03:26:45,561
So we have V 1 we have V 2

4939
03:26:45,561 --> 03:26:48,176
we have V 3 and then
we have before

4940
03:26:48,300 --> 03:26:53,073
and we'll close the curly braces
and to represent the edge set.

4941
03:26:53,073 --> 03:26:56,600
We use curly braces again
and then in curly braces,

4942
03:26:56,600 --> 03:27:00,907
we specify those two vertex
which are joined by the edge.

4943
03:27:01,000 --> 03:27:02,600
So for this Edge,

4944
03:27:02,600 --> 03:27:07,700
we will use a viven comma V
3 and then for this Edge

4945
03:27:07,700 --> 03:27:12,700
will use we one comma V
2 and then for this Edge again,

4946
03:27:12,700 --> 03:27:15,000
we'll use V 2 comma V 4.

4947
03:27:16,088 --> 03:27:19,011
And then at last
for this Edge will use

4948
03:27:19,300 --> 03:27:23,700
we do comma V 3 and At Last I
will close the curly braces.

4949
03:27:24,100 --> 03:27:26,400
So this is your vertices set.

4950
03:27:26,500 --> 03:27:28,900
And this is your headset.

4951
03:27:29,400 --> 03:27:31,958
Now one, very
important thing that is

4952
03:27:31,958 --> 03:27:35,476
if headset is containing
U comma V or you can say

4953
03:27:35,476 --> 03:27:38,700
that are instead
is containing V 1 comma V 3.

4954
03:27:38,700 --> 03:27:42,000
So V1 is basically
a adjacent to V 3.

4955
03:27:42,200 --> 03:27:45,100
Similarly your V
1 is adjacent to V 2.

4956
03:27:45,200 --> 03:27:48,427
Then V2 is adjacent
to V for and looking at this

4957
03:27:48,427 --> 03:27:50,900
as you can say V2
is adjacent to V 3.

4958
03:27:50,900 --> 03:27:53,686
Now, let's quickly move
ahead and we'll look

4959
03:27:53,686 --> 03:27:55,500
at different types of craft.

4960
03:27:55,500 --> 03:27:58,300
So first we have
undirected graphs.

4961
03:27:58,500 --> 03:28:00,936
So basically in
an undirected graph,

4962
03:28:00,936 --> 03:28:04,000
we use straight lines
to represent the edges.

4963
03:28:04,000 --> 03:28:08,350
Now the order of the vertices
in the edge set does not matter

4964
03:28:08,350 --> 03:28:09,800
in undirected graph.

4965
03:28:09,800 --> 03:28:14,040
So the undirected graph usually
are drawn using straight lines

4966
03:28:14,040 --> 03:28:15,500
between the vertices.

4967
03:28:15,500 --> 03:28:18,300
Now it is almost
similar to the graph

4968
03:28:18,300 --> 03:28:20,763
which we have seen
in the last slide.

4969
03:28:20,763 --> 03:28:21,563
Similarly.

4970
03:28:21,563 --> 03:28:25,000
We can again represent
the vertices set as 5 comma

4971
03:28:25,000 --> 03:28:27,500
6 comma 7 comma 8 and the edge

4972
03:28:27,500 --> 03:28:32,000
set as 5 comma 6 then
5 comma 7 now talking

4973
03:28:32,000 --> 03:28:33,643
about directed graphs.

4974
03:28:33,643 --> 03:28:37,605
So basically in a directed graph
the order of vertices

4975
03:28:37,605 --> 03:28:39,400
in the edge set matters.

4976
03:28:39,700 --> 03:28:43,100
So we use Arrow
to represent the edges

4977
03:28:43,300 --> 03:28:45,014
as you can see in the image

4978
03:28:45,014 --> 03:28:48,000
as It was not the case
with the undirected graph

4979
03:28:48,000 --> 03:28:49,900
where we were using
the straight lines.

4980
03:28:50,000 --> 03:28:51,400
So in directed graph,

4981
03:28:51,400 --> 03:28:56,000
we use Arrow to denote the edges
and the important thing is

4982
03:28:56,000 --> 03:28:58,214
The Edge set should be similar.

4983
03:28:58,214 --> 03:29:00,500
It will contain
the source vertex

4984
03:29:00,500 --> 03:29:04,200
that is five in this case
and the destination vertex,

4985
03:29:04,200 --> 03:29:09,400
which is 6 in this case and this
is never similar to six comma

4986
03:29:09,400 --> 03:29:13,300
five you cannot represent
this Edge as 6 comma 5

4987
03:29:13,400 --> 03:29:17,100
because the direction always
Does indeed directed graph

4988
03:29:17,100 --> 03:29:18,500
similarly you can see

4989
03:29:18,500 --> 03:29:20,556
that 5 is adjacent to 6,

4990
03:29:20,556 --> 03:29:23,787
but you cannot say
that 6 is adjacent to 5.

4991
03:29:24,200 --> 03:29:29,000
So for this graph the vertices
said would be similar as 5 comma

4992
03:29:29,000 --> 03:29:32,620
6 comma 7 comma 8
which was similar

4993
03:29:32,620 --> 03:29:34,158
in undirected graph,

4994
03:29:34,200 --> 03:29:38,700
but in directed graph your Edge
set should be your first opal.

4995
03:29:38,700 --> 03:29:42,835
This one will be 5 comma
6 then you second Edge,

4996
03:29:42,835 --> 03:29:46,528
which is this one would be
five comma Mama seven,

4997
03:29:47,000 --> 03:29:53,300
and at last your this set
would be 7 comma 8 but in case

4998
03:29:53,300 --> 03:29:56,166
of undirected graph
you can write this as

4999
03:29:56,166 --> 03:29:57,600
8 comma 7 or in case

5000
03:29:57,600 --> 03:30:00,400
of undirected graph you can
write this one as seven comma

5001
03:30:00,400 --> 03:30:03,369
5 but this is not the case
with the directed graph.

5002
03:30:03,369 --> 03:30:05,428
You have to follow
the source vertex

5003
03:30:05,428 --> 03:30:08,100
and the destination vertex
to represent the edge.

5004
03:30:08,100 --> 03:30:10,642
So I hope you guys are clear
with the undirected

5005
03:30:10,642 --> 03:30:11,846
and directed graph.

5006
03:30:11,846 --> 03:30:12,100
Now.

5007
03:30:12,100 --> 03:30:15,200
Let's talk about
vertex label graph now.

5008
03:30:15,200 --> 03:30:18,840
A Vertex liberal graph
each vertex is labeled

5009
03:30:18,840 --> 03:30:21,650
with some data
in addition to the data

5010
03:30:21,650 --> 03:30:23,700
that identifies the vertex.

5011
03:30:23,700 --> 03:30:28,100
So basically we say this X
or this v as the vertex ID.

5012
03:30:28,200 --> 03:30:29,500
So there will be data

5013
03:30:29,500 --> 03:30:31,800
that would be added
to this vertex.

5014
03:30:32,000 --> 03:30:35,200
So let's say this vertex
would be 6 comma

5015
03:30:35,200 --> 03:30:37,500
and then we are adding the color

5016
03:30:37,500 --> 03:30:39,700
so it would be purple next.

5017
03:30:39,800 --> 03:30:42,100
This vertex would be 8 comma

5018
03:30:42,100 --> 03:30:44,700
and the color
would be green next.

5019
03:30:44,700 --> 03:30:50,400
We'll say See this as 7 comma
read and then this one is as

5020
03:30:50,400 --> 03:30:54,400
five comma blue now
the six or this five

5021
03:30:54,400 --> 03:30:55,639
or seven or eight.

5022
03:30:55,639 --> 03:30:58,800
These are vertex ID
and the additional data,

5023
03:30:58,800 --> 03:31:03,500
which is attached is the color
like blue purple green or red.

5024
03:31:03,900 --> 03:31:08,696
But only the identifying data
is present in the pair of edges

5025
03:31:08,696 --> 03:31:12,543
or you can say only the ID
of the vertex is present

5026
03:31:12,543 --> 03:31:13,773
in the edge set.

5027
03:31:14,100 --> 03:31:15,322
So here the Edsel.

5028
03:31:15,322 --> 03:31:17,700
Again similar to
your directed graph

5029
03:31:17,700 --> 03:31:19,587
that is your Source ID this

5030
03:31:19,587 --> 03:31:21,992
which is 5 and
then destination ID,

5031
03:31:21,992 --> 03:31:25,274
which is 6 in this case
then for this case.

5032
03:31:25,274 --> 03:31:28,785
It's similar as five comma
7 then in for this case.

5033
03:31:28,785 --> 03:31:30,469
It's similar as 7 comma 8

5034
03:31:30,469 --> 03:31:33,600
so we are not specifying
this additional data,

5035
03:31:33,600 --> 03:31:35,699
which is attached
to the vertices.

5036
03:31:35,699 --> 03:31:36,878
That is the color.

5037
03:31:36,878 --> 03:31:40,121
If you only specify
the identifiers of the vertex

5038
03:31:40,121 --> 03:31:41,300
that is the number

5039
03:31:41,300 --> 03:31:44,700
but your vertex set
would be something

5040
03:31:44,700 --> 03:31:46,300
like so this vertex

5041
03:31:46,300 --> 03:31:50,100
would be 5 comma blue
then your next vertex

5042
03:31:50,100 --> 03:31:52,600
will become 6 comma purple

5043
03:31:53,100 --> 03:31:56,700
then your next vertex
will become 8 comma green

5044
03:31:57,000 --> 03:31:59,800
and at last your last
vertex will be written

5045
03:31:59,800 --> 03:32:01,100
as 7 comma read.

5046
03:32:01,100 --> 03:32:04,808
So basically when you
are specifying the vertices set

5047
03:32:04,808 --> 03:32:07,305
in the vertex label
graph you attach

5048
03:32:07,305 --> 03:32:10,683
the additional information
in the vertices are set

5049
03:32:10,683 --> 03:32:12,200
but while representing

5050
03:32:12,200 --> 03:32:16,183
the edge set it is represented
similarly as A directed graph

5051
03:32:16,183 --> 03:32:19,900
where you have to just specify
the source vertex identifier

5052
03:32:19,900 --> 03:32:20,900
and then you have

5053
03:32:20,900 --> 03:32:24,300
to specify the destination
vertex identifier now.

5054
03:32:24,300 --> 03:32:27,500
I hope that you guys are clear
with underrated directed

5055
03:32:27,500 --> 03:32:29,000
and vertex label graph.

5056
03:32:29,184 --> 03:32:33,615
So let's quickly move forward
next we have cyclic graph.

5057
03:32:33,800 --> 03:32:36,800
So a cyclic graph
is a directed graph

5058
03:32:36,900 --> 03:32:38,900
with at least one cycle

5059
03:32:39,000 --> 03:32:43,153
and the cycle is the path
along with the directed edges

5060
03:32:43,153 --> 03:32:44,933
from a Vertex to itself.

5061
03:32:44,933 --> 03:32:47,000
So so once you see over here,

5062
03:32:47,000 --> 03:32:47,708
you can see

5063
03:32:47,708 --> 03:32:50,541
that from this vertex
V. It's moving toward x

5064
03:32:50,541 --> 03:32:51,700
7 then it's moving

5065
03:32:51,700 --> 03:32:54,700
to vertex Aid then with arrows
moving to vertex six.

5066
03:32:54,700 --> 03:32:57,539
And then again,
it's moving to vertex V.

5067
03:32:57,539 --> 03:33:01,600
So there should be at least
one cycle in a cyclic graph.

5068
03:33:01,600 --> 03:33:04,000
There might be a new component.

5069
03:33:04,000 --> 03:33:08,400
It's a Vertex 9 which is
attached over here again,

5070
03:33:08,400 --> 03:33:10,401
so it would be a cyclic graph

5071
03:33:10,401 --> 03:33:13,300
because it has
one complete cycle over here

5072
03:33:13,300 --> 03:33:15,500
and the important
thing to notice is

5073
03:33:15,500 --> 03:33:20,300
That the arrow should make
the cycle like from 5 to 7

5074
03:33:20,300 --> 03:33:23,300
and then from 7 to 8
and then 8 to 6

5075
03:33:23,300 --> 03:33:25,300
and 6 to 5 and let's say

5076
03:33:25,300 --> 03:33:26,831
that there is an arrow

5077
03:33:26,831 --> 03:33:30,281
from 5 to 6 and then there
is an arrow from 6 to 8.

5078
03:33:30,281 --> 03:33:32,233
So we have flipped the arrows.

5079
03:33:32,233 --> 03:33:33,600
So in that situation,

5080
03:33:33,600 --> 03:33:36,372
this is not a cyclic graph
because the arrows

5081
03:33:36,372 --> 03:33:38,200
are not completing the cycle.

5082
03:33:38,200 --> 03:33:41,370
So once you move from 5 to 7
and then from 7 to 8,

5083
03:33:41,370 --> 03:33:44,452
you cannot move from 8:00
to 6:00 and similarly

5084
03:33:44,452 --> 03:33:47,167
once you move from 5 to 6
and then 6 to 8.

5085
03:33:47,167 --> 03:33:49,020
You cannot move from 8 to 7.

5086
03:33:49,020 --> 03:33:52,000
So in that situation,
it's not a cyclic graph.

5087
03:33:52,000 --> 03:33:54,307
So let's clear all this thing.

5088
03:33:54,307 --> 03:33:56,461
So will represent this cycle

5089
03:33:56,461 --> 03:34:00,300
as five then using
double arrows will go to 7

5090
03:34:00,300 --> 03:34:05,300
and then we'll move to 8
and then we'll move to 6

5091
03:34:05,300 --> 03:34:09,774
and at last we'll
come back to 5 now.

5092
03:34:09,774 --> 03:34:11,851
We have Edge liberal graph.

5093
03:34:12,000 --> 03:34:15,030
So basically as label
graph is a graph.

5094
03:34:15,030 --> 03:34:17,752
The edges are
associated with labels.

5095
03:34:17,752 --> 03:34:22,059
So one can basically indicate
this by making the edge set

5096
03:34:22,059 --> 03:34:23,906
as be a set of triplets.

5097
03:34:23,906 --> 03:34:25,600
So for example,

5098
03:34:25,600 --> 03:34:26,900
let's say this H

5099
03:34:26,900 --> 03:34:30,875
in this Edge label graph
will be denoted as the source

5100
03:34:30,875 --> 03:34:33,200
which is 6 then the destination

5101
03:34:33,200 --> 03:34:38,000
which is 7 and then the label
of the edge which is blue.

5102
03:34:38,000 --> 03:34:41,400
So this Edge would
be defined something

5103
03:34:41,400 --> 03:34:44,700
like 6 comma 7 comma blue
and then for this

5104
03:34:44,700 --> 03:34:47,100
and Hurley The Source vertex

5105
03:34:47,100 --> 03:34:49,414
that is 7 the
destination vertex,

5106
03:34:49,414 --> 03:34:52,100
which is 8 then
the label of the edge,

5107
03:34:52,100 --> 03:34:55,400
which is white like
similarly for this Edge.

5108
03:34:55,400 --> 03:35:00,200
It's five comma 7 and
then blue comma red.

5109
03:35:01,000 --> 03:35:03,076
And it lasts for this Edge.

5110
03:35:03,076 --> 03:35:09,200
It's five comma six and then it
would be yellow common green,

5111
03:35:09,200 --> 03:35:11,362
which is the label of the edge.

5112
03:35:11,362 --> 03:35:14,665
So all these four edges
will become the headset

5113
03:35:14,665 --> 03:35:18,400
for this graph and the vertices
set is almost similar

5114
03:35:18,400 --> 03:35:21,200
that is 5 comma
6 comma 7 comma 8 now

5115
03:35:21,200 --> 03:35:24,200
to generalize this I would say x

5116
03:35:24,200 --> 03:35:26,400
comma y so X here is

5117
03:35:26,400 --> 03:35:30,700
the source vertex then why
here is the destination vertex?

5118
03:35:30,700 --> 03:35:33,914
X and then a here is
the label of the edge

5119
03:35:33,914 --> 03:35:36,900
then Edge label graph
are usually drawn

5120
03:35:36,900 --> 03:35:39,573
with the labels written
adjacent to the Earth

5121
03:35:39,573 --> 03:35:40,902
specifying the edges

5122
03:35:40,902 --> 03:35:41,900
as you can see.

5123
03:35:41,900 --> 03:35:43,900
We have mentioned blue white

5124
03:35:43,900 --> 03:35:46,695
and all those label
addition to the edges.

5125
03:35:46,695 --> 03:35:50,400
So I hope you guys a player
with the edge label graph,

5126
03:35:50,400 --> 03:35:51,561
which is nothing

5127
03:35:51,561 --> 03:35:54,900
but labels attached
to each and every Edge now,

5128
03:35:54,900 --> 03:35:57,200
let's talk about weighted graph.

5129
03:35:57,200 --> 03:36:00,310
So we did graph is
an edge label draft.

5130
03:36:00,700 --> 03:36:03,700
Where the labels
can be operated on by

5131
03:36:03,700 --> 03:36:06,921
usually automatic operators
or comparison operators,

5132
03:36:06,921 --> 03:36:09,700
like less than or greater
than symbol usually

5133
03:36:09,700 --> 03:36:12,900
these are integers
or floats and the idea is

5134
03:36:12,900 --> 03:36:15,534
that some edges
may be more expensive

5135
03:36:15,534 --> 03:36:18,900
and this cost is represented
by the edge labels

5136
03:36:18,900 --> 03:36:22,992
or weights now in short weighted
graphs are a special kind

5137
03:36:22,992 --> 03:36:24,500
of Edgley build rafts

5138
03:36:24,500 --> 03:36:27,200
where your Edge
is attached to a weight.

5139
03:36:27,200 --> 03:36:29,800
Generally, which is
a integer or a float

5140
03:36:29,800 --> 03:36:33,100
so that we can perform
some addition or subtraction

5141
03:36:33,100 --> 03:36:35,452
or different kind
of automatic operations

5142
03:36:35,452 --> 03:36:36,689
or it can be some kind

5143
03:36:36,689 --> 03:36:39,500
of conditional operations
like less than or greater

5144
03:36:39,500 --> 03:36:40,800
than so we'll again

5145
03:36:40,800 --> 03:36:45,700
represent this Edge as 5 comma
6 and then the weight as 3

5146
03:36:46,100 --> 03:36:49,900
and similarly will represent
this Edge as 6 comma

5147
03:36:49,900 --> 03:36:55,351
7 and the weight is again
6 so similarly we represent

5148
03:36:55,351 --> 03:36:57,197
these two edges as well.

5149
03:36:57,300 --> 03:36:57,900
So I hope

5150
03:36:57,900 --> 03:37:00,500
that you guys are clear
with the weighted graphs.

5151
03:37:00,500 --> 03:37:02,300
Now let's quickly
move ahead and look

5152
03:37:02,300 --> 03:37:04,200
at this directed acyclic graph.

5153
03:37:04,200 --> 03:37:06,900
So this is
a directed acyclic graph,

5154
03:37:07,100 --> 03:37:09,500
which is basically
without Cycles.

5155
03:37:09,500 --> 03:37:12,445
So as we just discussed
in cyclic graphs here,

5156
03:37:12,445 --> 03:37:13,151
you can see

5157
03:37:13,151 --> 03:37:16,601
that it is not completing
the graph from the directions

5158
03:37:16,601 --> 03:37:19,607
or you can say the direction
of the edges, right?

5159
03:37:19,607 --> 03:37:21,011
We can move from 5 to 7,

5160
03:37:21,011 --> 03:37:22,164
then seven to eight

5161
03:37:22,164 --> 03:37:25,500
but we cannot move from 8 to 6
and similarly we can move

5162
03:37:25,500 --> 03:37:27,600
from 5:00 to 6:00
then 6:00 to 8:00,

5163
03:37:27,600 --> 03:37:29,700
but we cannot move from 8 to 7.

5164
03:37:29,700 --> 03:37:32,962
So this is Not forming
a cycle and these kind

5165
03:37:32,962 --> 03:37:36,300
of crafts are known as
directed acyclic graph.

5166
03:37:36,300 --> 03:37:39,914
Now, they appear as special
cases in CS application all

5167
03:37:39,914 --> 03:37:41,855
the time and the vertices set

5168
03:37:41,855 --> 03:37:44,600
and the edge set
are represented similarly

5169
03:37:44,700 --> 03:37:46,700
as we have seen
earlier not talking

5170
03:37:46,700 --> 03:37:48,670
about the disconnected graph.

5171
03:37:48,670 --> 03:37:51,972
So vertices in a graph
do not need to be connected

5172
03:37:51,972 --> 03:37:53,100
to other vertices.

5173
03:37:53,100 --> 03:37:54,466
It is basically legal

5174
03:37:54,466 --> 03:37:57,200
for a graph to have
disconnected components

5175
03:37:57,200 --> 03:38:00,466
or even loan vertices
without a single connection.

5176
03:38:00,466 --> 03:38:04,400
So basically this disconnected
graph which has four vertices

5177
03:38:04,400 --> 03:38:05,300
but no edges.

5178
03:38:05,300 --> 03:38:05,543
Now.

5179
03:38:05,543 --> 03:38:08,100
Let me tell you something
important that is

5180
03:38:08,100 --> 03:38:10,176
what our sources and sinks.

5181
03:38:10,200 --> 03:38:13,738
So let's say we have
one Arrow from five to six

5182
03:38:13,738 --> 03:38:18,233
and one Arrow from 5 to 7
now word is with only

5183
03:38:18,233 --> 03:38:20,233
in arrows are called sink.

5184
03:38:20,600 --> 03:38:25,200
So the 7 and 6 are known
as sinks and the vertices

5185
03:38:25,307 --> 03:38:28,400
with only out arrows
are called sources.

5186
03:38:28,400 --> 03:38:32,500
So as you can see in the image
this Five only have out arrows

5187
03:38:32,500 --> 03:38:33,800
to six and seven.

5188
03:38:33,800 --> 03:38:36,200
So these are called sources now.

5189
03:38:36,200 --> 03:38:38,506
We'll talk about this
in a while guys.

5190
03:38:38,506 --> 03:38:41,500
Once we are going
through the pagerank algorithm.

5191
03:38:41,500 --> 03:38:45,228
So I hope that you guys know
what our vertices what our edges

5192
03:38:45,228 --> 03:38:48,149
how vertices and edges
represents the graph then

5193
03:38:48,149 --> 03:38:50,200
what are different
kinds of graph?

5194
03:38:50,384 --> 03:38:52,615
Let's move to the next topic.

5195
03:38:52,800 --> 03:38:54,236
So next let's know.

5196
03:38:54,236 --> 03:38:55,900
What is Park Graphics.

5197
03:38:55,900 --> 03:38:58,616
So talking about
Graphics Graphics is

5198
03:38:58,616 --> 03:39:00,519
a new component in spark.

5199
03:39:00,519 --> 03:39:03,843
For graphs and crafts
parallel computation now

5200
03:39:03,843 --> 03:39:06,170
at a high level graphic extends

5201
03:39:06,170 --> 03:39:09,954
The Spark rdd by introducing
a new graph abstraction

5202
03:39:09,954 --> 03:39:12,046
that is directed multigraph

5203
03:39:12,046 --> 03:39:15,122
that is properties
attached to each vertex

5204
03:39:15,122 --> 03:39:18,800
and Edge now to support
craft computation Graphics

5205
03:39:18,800 --> 03:39:22,320
basically exposes a set
of fundamental operators,

5206
03:39:22,320 --> 03:39:25,400
like finding sub graph
for joining vertices

5207
03:39:25,400 --> 03:39:30,253
or aggregating messages as well
as it also exposes and optimize.

5208
03:39:30,253 --> 03:39:34,713
This variant of the pregnant
a pi in addition Graphics also

5209
03:39:34,713 --> 03:39:37,987
provides you a collection
of graph algorithms

5210
03:39:37,987 --> 03:39:41,700
and Builders to simplify
your spark analytics tasks.

5211
03:39:41,700 --> 03:39:45,600
So basically your graphics
is extending your spark rdd.

5212
03:39:45,600 --> 03:39:48,800
Then you have Graphics
is providing an abstraction

5213
03:39:48,800 --> 03:39:50,614
that is directed multigraph

5214
03:39:50,614 --> 03:39:53,800
with properties attached
to each vertex and Edge.

5215
03:39:53,800 --> 03:39:56,800
So we'll look at this
property graph in a while.

5216
03:39:56,800 --> 03:40:00,200
Then again Graphics gives you
some fundamental operators

5217
03:40:00,200 --> 03:40:01,000
and Then it also

5218
03:40:01,000 --> 03:40:03,800
provides you some graph
algorithms and Builders

5219
03:40:03,800 --> 03:40:07,260
which makes your analytics
easier now to get started

5220
03:40:07,260 --> 03:40:11,400
you first need to import spark
and Graphics into your project.

5221
03:40:11,400 --> 03:40:12,550
So as you can see,

5222
03:40:12,550 --> 03:40:15,875
we are importing first Park
and then we are importing

5223
03:40:15,875 --> 03:40:19,200
spark Graphics to get
those graphics functionalities.

5224
03:40:19,200 --> 03:40:21,150
And at last we are importing

5225
03:40:21,150 --> 03:40:25,400
spark rdd to use those already
functionalities in our program.

5226
03:40:25,400 --> 03:40:28,098
But let me tell you
that if you are not using

5227
03:40:28,098 --> 03:40:30,400
spark shell then you
will need a spark.

5228
03:40:30,400 --> 03:40:31,807
Context in your program.

5229
03:40:31,807 --> 03:40:32,341
So I hope

5230
03:40:32,341 --> 03:40:35,400
that you guys are clear
with the features of graphics

5231
03:40:35,400 --> 03:40:36,400
and the libraries

5232
03:40:36,400 --> 03:40:39,200
which you need to import
in order to use Graphics.

5233
03:40:39,300 --> 03:40:43,500
So let us quickly move ahead
and look at the property graph.

5234
03:40:43,500 --> 03:40:45,800
Now property graph is something

5235
03:40:45,800 --> 03:40:50,300
as the name suggests property
graph have properties attached

5236
03:40:50,300 --> 03:40:52,400
to each vertex and Edge.

5237
03:40:52,500 --> 03:40:54,115
So the property graph

5238
03:40:54,115 --> 03:40:58,653
is a directed multigraph with
user-defined objects attached

5239
03:40:58,653 --> 03:41:00,500
to each vertex and Edge.

5240
03:41:00,500 --> 03:41:03,700
Now you might be wondering
what is undirected multigraph.

5241
03:41:03,700 --> 03:41:08,123
So a directed multi graph is a
directed graph with potentially

5242
03:41:08,123 --> 03:41:11,137
multiple parallel edges
sharing same source

5243
03:41:11,137 --> 03:41:13,050
and same destination vertex.

5244
03:41:13,050 --> 03:41:15,102
So as you can see in the image

5245
03:41:15,102 --> 03:41:17,700
that from San Francisco
to Los Angeles,

5246
03:41:17,700 --> 03:41:22,106
we have two edges and similarly
from Los Angeles to Chicago.

5247
03:41:22,106 --> 03:41:23,600
There are two edges.

5248
03:41:23,600 --> 03:41:26,019
So basically in
a directed multigraph,

5249
03:41:26,019 --> 03:41:28,400
the first thing is
the directed graph,

5250
03:41:28,400 --> 03:41:30,386
so it should have a Direction.

5251
03:41:30,386 --> 03:41:33,300
Ian attached to the edges
and then talking

5252
03:41:33,300 --> 03:41:36,100
about multigraph so
between Source vertex

5253
03:41:36,100 --> 03:41:37,850
and a destination vertex,

5254
03:41:37,850 --> 03:41:39,600
there could be two edges.

5255
03:41:39,800 --> 03:41:42,886
So the ability to
support parallel edges

5256
03:41:42,886 --> 03:41:46,100
basically simplifies
the modeling scenarios

5257
03:41:46,100 --> 03:41:49,054
where there can be
multiple relationships

5258
03:41:49,054 --> 03:41:51,997
between the same vertices
for an example.

5259
03:41:51,997 --> 03:41:54,200
Let's say these are two persons

5260
03:41:54,200 --> 03:41:56,644
so they can be friends
as well as they

5261
03:41:56,644 --> 03:41:58,361
can be co-workers, right?

5262
03:41:58,361 --> 03:42:02,000
So these kind of scenarios
can be Easily modeled using

5263
03:42:02,000 --> 03:42:03,900
directed multigraph now.

5264
03:42:03,900 --> 03:42:08,700
Each vertex is keyed by
a unique 64-bit long identifier,

5265
03:42:08,800 --> 03:42:12,700
which is basically the vertex ID
and it helps an indexing.

5266
03:42:12,700 --> 03:42:16,500
So each of your vertex
contains a Vertex ID,

5267
03:42:16,600 --> 03:42:20,000
which is a unique
64-bit long identifier

5268
03:42:20,200 --> 03:42:21,900
and similarly edges

5269
03:42:21,900 --> 03:42:26,600
have corresponding source and
destination vertex identifiers.

5270
03:42:26,700 --> 03:42:28,174
So this Edge would have

5271
03:42:28,174 --> 03:42:31,647
this vertex identifier as
well as This vertex identifier

5272
03:42:31,647 --> 03:42:35,620
or you can say Source vertex ID
and the destination vertex ID.

5273
03:42:35,620 --> 03:42:37,900
So as we discuss
this property graph

5274
03:42:37,900 --> 03:42:42,300
is basically parameterised
over the vertex and Edge types,

5275
03:42:42,300 --> 03:42:45,684
and these are the types
of objects associated

5276
03:42:45,684 --> 03:42:47,700
with each vertex and Edge.

5277
03:42:48,400 --> 03:42:51,792
So your graphics basically
optimizes the representation

5278
03:42:51,792 --> 03:42:53,300
of vertex and Edge types

5279
03:42:53,300 --> 03:42:56,900
and it reduces the in
memory footprint by storing

5280
03:42:56,900 --> 03:43:00,400
the primitive data types
in a specialized array.

5281
03:43:00,400 --> 03:43:04,400
In some cases it might be
desirable to have vertices

5282
03:43:04,400 --> 03:43:07,200
with different property types
in the same graph.

5283
03:43:07,200 --> 03:43:10,400
Now this can be accomplished
through inheritance.

5284
03:43:10,400 --> 03:43:14,000
So for an example to model
a user and product

5285
03:43:14,000 --> 03:43:15,300
in a bipartite graph,

5286
03:43:15,300 --> 03:43:17,676
or you can see
that we have user property

5287
03:43:17,676 --> 03:43:19,400
and we have product property.

5288
03:43:19,400 --> 03:43:19,762
Okay.

5289
03:43:19,762 --> 03:43:23,400
So let me first tell you
what is a bipartite graph.

5290
03:43:23,400 --> 03:43:26,861
So a bipartite graph
is also called a by graph

5291
03:43:27,000 --> 03:43:29,500
which is a set
of graph vertices.

5292
03:43:30,300 --> 03:43:35,400
Opposed into two disjoint sets
such that no two graph vertices

5293
03:43:35,469 --> 03:43:37,930
within the same
set are adjacent.

5294
03:43:38,100 --> 03:43:39,700
So as you can see over here,

5295
03:43:39,700 --> 03:43:43,000
we have user property and then
we have product property

5296
03:43:43,000 --> 03:43:46,282
but no to user property
can be adjacent or you

5297
03:43:46,282 --> 03:43:48,592
can say there should be no edges

5298
03:43:48,592 --> 03:43:51,707
that is joining any
of the to user property or

5299
03:43:51,707 --> 03:43:53,300
there should be no Edge

5300
03:43:53,300 --> 03:43:56,000
that should be joining
product property.

5301
03:43:56,400 --> 03:44:00,000
So in this scenario
we use inheritance.

5302
03:44:00,200 --> 03:44:01,757
So as you can see here,

5303
03:44:01,757 --> 03:44:04,600
we have class vertex
property now basically

5304
03:44:04,600 --> 03:44:07,400
what we are doing we
are creating another class

5305
03:44:07,400 --> 03:44:08,900
with user property.

5306
03:44:08,900 --> 03:44:10,700
And here we have name,

5307
03:44:10,700 --> 03:44:13,500
which is again a string
and we are extending

5308
03:44:13,500 --> 03:44:17,038
or you can say we are inheriting
the vertex property class.

5309
03:44:17,038 --> 03:44:19,600
Now again, in the case
of product property.

5310
03:44:19,600 --> 03:44:22,100
We have name that is
name of the product

5311
03:44:22,100 --> 03:44:25,000
which is again string and then
we have price of the product

5312
03:44:25,000 --> 03:44:25,985
which is double

5313
03:44:25,985 --> 03:44:29,400
and we are again extending
this vertex property graph

5314
03:44:29,400 --> 03:44:32,900
and at last You're grading a
graph with this vertex property

5315
03:44:32,900 --> 03:44:33,900
and then string.

5316
03:44:33,900 --> 03:44:37,045
So this is how we
can basically model user

5317
03:44:37,045 --> 03:44:39,500
and product as
a bipartite graph.

5318
03:44:39,500 --> 03:44:41,430
So we have created user property

5319
03:44:41,430 --> 03:44:44,265
as well as we have created
this product property

5320
03:44:44,265 --> 03:44:47,100
and we are extending
this vertex property class.

5321
03:44:47,400 --> 03:44:50,076
No talking about
this property graph.

5322
03:44:50,076 --> 03:44:51,907
It's similar to your rdd.

5323
03:44:51,907 --> 03:44:55,900
So like your rdd property graph
are immutable distributed

5324
03:44:55,900 --> 03:44:57,200
and fault tolerant.

5325
03:44:57,200 --> 03:45:00,491
So changes to the values
or structure of the graph.

5326
03:45:00,491 --> 03:45:01,908
Basically accomplished

5327
03:45:01,908 --> 03:45:04,900
by producing a new graph
with the desired changes

5328
03:45:04,900 --> 03:45:07,700
and the substantial part
of the original graph

5329
03:45:07,700 --> 03:45:09,900
which can be your structure
of the graph

5330
03:45:09,900 --> 03:45:11,800
or attributes or indices.

5331
03:45:11,800 --> 03:45:15,081
These are basically reused
in the new graph reducing

5332
03:45:15,081 --> 03:45:18,040
the cost of inherent
functional data structure.

5333
03:45:18,040 --> 03:45:20,100
So basically your property graph

5334
03:45:20,100 --> 03:45:22,500
once you're trying to change
values of structure.

5335
03:45:22,500 --> 03:45:26,024
So it creates a new graph
with changed structure

5336
03:45:26,024 --> 03:45:27,300
or changed values

5337
03:45:27,300 --> 03:45:30,182
and zero substantial part
of original graph.

5338
03:45:30,182 --> 03:45:33,300
Re used multiple times
to improve the performance

5339
03:45:33,300 --> 03:45:35,900
and it can be
your structure of the graph

5340
03:45:35,900 --> 03:45:38,600
which is getting reuse
or it can be your attributes

5341
03:45:38,600 --> 03:45:41,000
or indices of the graph
which is getting reused.

5342
03:45:41,000 --> 03:45:44,400
So this is how your property
graph provides efficiency.

5343
03:45:44,400 --> 03:45:46,400
Now, the graph is partitioned

5344
03:45:46,400 --> 03:45:48,800
across the executors
using a range

5345
03:45:48,800 --> 03:45:50,500
of vertex partitioning rules,

5346
03:45:50,500 --> 03:45:52,700
which are basically
Loosely defined

5347
03:45:52,700 --> 03:45:56,514
and similar to our DD
each partition of the graph

5348
03:45:56,514 --> 03:45:57,800
can be recreated

5349
03:45:57,800 --> 03:46:01,100
on different machines
in the event of Failure.

5350
03:46:01,100 --> 03:46:05,000
So this is how your property
graph provides fault tolerance.

5351
03:46:05,000 --> 03:46:07,643
So as we already
discussed logically

5352
03:46:07,643 --> 03:46:12,174
the property graph corresponds
to a pair of type collections,

5353
03:46:12,174 --> 03:46:15,800
including the properties
for each vertex and Edge

5354
03:46:15,800 --> 03:46:17,338
and as a consequence

5355
03:46:17,338 --> 03:46:21,492
the graph class contains
members to access the vertices

5356
03:46:21,492 --> 03:46:22,569
and the edges.

5357
03:46:22,800 --> 03:46:24,067
So as you can see we

5358
03:46:24,067 --> 03:46:27,300
have graphed class then you
can see we have vertices

5359
03:46:27,307 --> 03:46:28,692
and we have edges.

5360
03:46:29,500 --> 03:46:34,400
Now this vertex Rd DVD
is extending your rdd,

5361
03:46:34,600 --> 03:46:41,100
which is your body
D and then your vertex ID

5362
03:46:41,500 --> 03:46:43,807
and then your vertex property.

5363
03:46:44,600 --> 03:46:45,100
Similarly.

5364
03:46:45,100 --> 03:46:47,600
Your Edge rdd is extending

5365
03:46:47,600 --> 03:46:53,500
your Oddity with your Edge
property so the classes

5366
03:46:53,500 --> 03:46:54,900
that is vertex rdd

5367
03:46:54,900 --> 03:47:00,100
and HR DD extends under
optimized version of your rdd,

5368
03:47:00,100 --> 03:47:03,810
which includes vertex idn
vertex property and your rdd

5369
03:47:03,810 --> 03:47:06,746
which includes your Edge
property and Booth

5370
03:47:06,746 --> 03:47:07,795
this vertex rdd

5371
03:47:07,795 --> 03:47:11,501
and hrd provides additional
functionality build on top

5372
03:47:11,501 --> 03:47:12,876
of graph computation

5373
03:47:12,876 --> 03:47:15,900
and leverages internal
optimizations as well.

5374
03:47:15,900 --> 03:47:19,159
So this is the reason we use
this Vertex rdd or Edge already

5375
03:47:19,159 --> 03:47:22,500
because it already extends your
already containing your word.

5376
03:47:22,500 --> 03:47:23,888
X ID and vertex property

5377
03:47:23,888 --> 03:47:26,700
or your Edge property
it also provides you

5378
03:47:26,700 --> 03:47:30,100
additional functionalities built
on top of craft computation.

5379
03:47:30,100 --> 03:47:33,700
And again, it gives you some
internal optimizations as well.

5380
03:47:34,100 --> 03:47:37,715
Now, let me clear
this and let's take an example

5381
03:47:37,715 --> 03:47:39,000
of property graph

5382
03:47:39,000 --> 03:47:40,633
where the vertex property

5383
03:47:40,633 --> 03:47:43,300
might contain the user
name and occupation.

5384
03:47:43,300 --> 03:47:47,200
So as you can see in this table
that we have ID of the vertex

5385
03:47:47,200 --> 03:47:50,000
and then we have property
attached to each vertex.

5386
03:47:50,000 --> 03:47:52,602
That is the username
as well as the Station

5387
03:47:52,602 --> 03:47:55,700
of the user or you can see
the profession of the user

5388
03:47:55,700 --> 03:47:58,715
and we can annotate
the edges with the string

5389
03:47:58,715 --> 03:48:01,800
describing the relationship
between the users.

5390
03:48:01,800 --> 03:48:04,400
So so as you can see
first is Thomas

5391
03:48:04,400 --> 03:48:06,300
who is a professor
then second is Frank

5392
03:48:06,300 --> 03:48:08,000
who is also a professor then

5393
03:48:08,000 --> 03:48:09,900
as you can see third is Jenny.

5394
03:48:09,900 --> 03:48:12,241
She's a student and forth is Bob

5395
03:48:12,241 --> 03:48:15,997
who is a doctor now Thomas is
a colleague of Frank.

5396
03:48:15,997 --> 03:48:17,200
Then you can see

5397
03:48:17,200 --> 03:48:21,000
that Thomas is academic
advisor of Jenny again.

5398
03:48:21,000 --> 03:48:23,153
Frank is also a Make advisor

5399
03:48:23,153 --> 03:48:27,692
of Jenny and then the doctor
is the health advisor of Jenny.

5400
03:48:27,700 --> 03:48:31,200
So the resulting graph
would have a signature

5401
03:48:31,200 --> 03:48:32,800
of something like this.

5402
03:48:32,800 --> 03:48:34,800
So I'll explain this in a while.

5403
03:48:34,900 --> 03:48:38,300
So there are numerous ways
to construct the property graph

5404
03:48:38,300 --> 03:48:39,300
from raw files

5405
03:48:39,300 --> 03:48:43,400
or RDS or even synthetic
generators and we'll discuss it

5406
03:48:43,400 --> 03:48:44,766
in graph Builders,

5407
03:48:44,766 --> 03:48:46,313
but the very probable

5408
03:48:46,313 --> 03:48:49,700
and most General method
is to use graph object.

5409
03:48:49,700 --> 03:48:52,129
So let's take a look
at the code first.

5410
03:48:52,129 --> 03:48:53,651
And so first over here,

5411
03:48:53,651 --> 03:48:55,900
we are assuming
that Parker context

5412
03:48:55,900 --> 03:48:58,100
has already been constructed.

5413
03:48:58,100 --> 03:49:01,700
Then we are giving
the SES power context next.

5414
03:49:01,700 --> 03:49:04,600
We are creating an rdd
for the vertices.

5415
03:49:04,600 --> 03:49:06,689
So as you can see for users,

5416
03:49:06,689 --> 03:49:09,600
we have specified idd
and then vertex ID

5417
03:49:09,600 --> 03:49:11,393
and then these are two strings.

5418
03:49:11,393 --> 03:49:12,605
So first one would be

5419
03:49:12,605 --> 03:49:15,900
your username and the second one
will be your profession.

5420
03:49:15,900 --> 03:49:19,612
Then we are using SC paralyzed
and we are creating an array

5421
03:49:19,612 --> 03:49:22,300
where we are specifying
all the vertices so

5422
03:49:22,300 --> 03:49:23,838
And that is this one

5423
03:49:23,900 --> 03:49:25,900
and you are getting
the name as Thomas

5424
03:49:25,900 --> 03:49:26,800
and the profession

5425
03:49:26,800 --> 03:49:30,646
is Professor similarly
for to well Frank Professor.

5426
03:49:30,646 --> 03:49:34,600
Then 3L Jenny cheese student
and 4L Bob doctors.

5427
03:49:34,600 --> 03:49:37,746
So here we have created
the vertex next.

5428
03:49:37,746 --> 03:49:40,207
We are creating
an rdd for edges.

5429
03:49:40,500 --> 03:49:43,400
So first we are giving
the values relationship.

5430
03:49:43,400 --> 03:49:46,400
Then we are creating
an rdd with Edge string

5431
03:49:46,400 --> 03:49:50,000
and then we're using SC
paralyzed to create the edge

5432
03:49:50,000 --> 03:49:52,948
and in the array we are
specifying the A source vertex,

5433
03:49:52,948 --> 03:49:55,595
then we are specifying
the destination vertex.

5434
03:49:55,595 --> 03:49:57,400
And then we are
giving the relation

5435
03:49:57,400 --> 03:50:01,000
that is colleague similarly
for next Edge resources

5436
03:50:01,000 --> 03:50:02,800
when this nation is one

5437
03:50:02,800 --> 03:50:06,131
and then the profession
is academic advisor

5438
03:50:06,165 --> 03:50:07,934
and then it goes so on.

5439
03:50:08,242 --> 03:50:11,857
So then this line we
are defining a default user

5440
03:50:12,200 --> 03:50:16,276
in case there is a relationship
between missing users.

5441
03:50:16,300 --> 03:50:18,900
Now we have given
the name as default user

5442
03:50:18,900 --> 03:50:20,800
and the profession is missing.

5443
03:50:21,400 --> 03:50:24,000
Nature trying to build
an initial graph.

5444
03:50:24,000 --> 03:50:27,100
So for that we are using
this graph object.

5445
03:50:27,100 --> 03:50:30,100
So we have specified users
that is your vertices.

5446
03:50:30,100 --> 03:50:34,300
Then we are specifying the
relations that is your edges.

5447
03:50:34,400 --> 03:50:36,867
And then we are giving
the default user

5448
03:50:36,867 --> 03:50:39,400
which is basically
for any missing user.

5449
03:50:39,400 --> 03:50:41,800
So now as you can see over here,

5450
03:50:41,800 --> 03:50:46,700
we are using Edge case class
and edges have a source ID

5451
03:50:46,700 --> 03:50:48,300
and a destination ID,

5452
03:50:48,300 --> 03:50:51,300
which is basically
corresponding to your source

5453
03:50:51,300 --> 03:50:52,800
and destination vertex.

5454
03:50:52,800 --> 03:50:55,100
And in addition
to the Edge class.

5455
03:50:55,100 --> 03:50:56,900
We have an attribute member

5456
03:50:56,900 --> 03:51:00,600
which stores The Edge property
which is the relation over here

5457
03:51:00,600 --> 03:51:01,600
that is colleague

5458
03:51:01,600 --> 03:51:06,138
or it is academic advisor or it
is Health advisor and so on.

5459
03:51:06,200 --> 03:51:06,900
So, I hope

5460
03:51:06,900 --> 03:51:10,287
that you guys are clear
about creating a property graph

5461
03:51:10,287 --> 03:51:13,800
how to specify the vertices
how to specify edges and then

5462
03:51:13,800 --> 03:51:17,763
how to create a graph Now
we can deconstruct a graph

5463
03:51:17,763 --> 03:51:19,461
into respective vertex

5464
03:51:19,461 --> 03:51:23,000
and Edge views by using
a graph toward vertices

5465
03:51:23,000 --> 03:51:24,900
and graph edges members.

5466
03:51:25,000 --> 03:51:27,041
So as you can see
we are using craft

5467
03:51:27,041 --> 03:51:30,100
or vertices over here
and crafts dot edges over here.

5468
03:51:30,100 --> 03:51:32,100
Now what we are trying to do.

5469
03:51:32,100 --> 03:51:35,900
So first over here the graph
which we have created earlier.

5470
03:51:35,900 --> 03:51:37,291
So we have graphed

5471
03:51:37,300 --> 03:51:40,700
vertices dot filter Now
using this case class.

5472
03:51:40,700 --> 03:51:42,300
We have this vertex ID.

5473
03:51:42,300 --> 03:51:45,378
We have the name and then
we have the position.

5474
03:51:45,378 --> 03:51:48,322
And we are specifying
the position as doctor.

5475
03:51:48,322 --> 03:51:51,400
So first we are trying
to filter the profession

5476
03:51:51,400 --> 03:51:53,600
of the user as doctor.

5477
03:51:53,600 --> 03:51:55,400
And then we are trying to count.

5478
03:51:55,400 --> 03:51:55,630
It.

5479
03:51:55,900 --> 03:51:56,900
Next.

5480
03:51:56,900 --> 03:51:59,700
We are specifying
graph edges filter

5481
03:51:59,900 --> 03:52:03,270
and we are basically
trying to filter the edges

5482
03:52:03,270 --> 03:52:07,300
where the source ID is greater
than your destination ID.

5483
03:52:07,300 --> 03:52:09,800
And then we are trying
to count those edges.

5484
03:52:09,800 --> 03:52:12,600
We are using
a Scala case expression

5485
03:52:12,600 --> 03:52:15,400
as you can see to
deconstruct the temple.

5486
03:52:15,500 --> 03:52:17,400
You can say to deconstruct

5487
03:52:17,400 --> 03:52:23,358
the result on the other hand
craft edges returns a edge rdd,

5488
03:52:23,358 --> 03:52:26,282
which is containing
Edge string object.

5489
03:52:26,400 --> 03:52:30,800
So we could also have used
the case Class Type Constructor

5490
03:52:30,900 --> 03:52:32,200
as you can see here.

5491
03:52:32,200 --> 03:52:34,832
So again over here we
are using graph dot s

5492
03:52:34,832 --> 03:52:36,400
dot filter and over here.

5493
03:52:36,400 --> 03:52:40,400
We have given case h and then
we are specifying the property

5494
03:52:40,400 --> 03:52:43,900
that is Source destination
and then property of the edge

5495
03:52:43,900 --> 03:52:45,000
which is attached.

5496
03:52:45,000 --> 03:52:48,800
And then we are filtering it and
then we are trying to count it.

5497
03:52:48,800 --> 03:52:53,547
So this is how using Edge class
either you can see with edges

5498
03:52:53,547 --> 03:52:55,603
or you can see with vertices.

5499
03:52:55,603 --> 03:52:59,191
This is how you can go ahead
and deconstruct them.

5500
03:52:59,191 --> 03:53:01,900
Right because you're
grounded vertices

5501
03:53:01,900 --> 03:53:06,300
or your s dot vertices returns
a Vertex rdd or Edge rdd.

5502
03:53:06,400 --> 03:53:07,947
So to deconstruct them,

5503
03:53:07,947 --> 03:53:10,100
we basically use
this case class.

5504
03:53:10,100 --> 03:53:11,000
So I hope you

5505
03:53:11,000 --> 03:53:13,742
guys are clear about
transforming property graph.

5506
03:53:13,742 --> 03:53:15,400
And how do you use this case?

5507
03:53:15,400 --> 03:53:19,300
Us to deconstruct
the protects our DD or HR DD.

5508
03:53:20,169 --> 03:53:22,630
So now let's quickly move ahead.

5509
03:53:22,700 --> 03:53:24,875
Now in addition to the vertex

5510
03:53:24,875 --> 03:53:27,406
and Edge views
of the property graph

5511
03:53:27,406 --> 03:53:30,300
Graphics also exposes
a triplet view now,

5512
03:53:30,300 --> 03:53:32,700
you might be wondering
what is a triplet view.

5513
03:53:32,700 --> 03:53:35,977
So the triplet view
logically joins the vertex

5514
03:53:35,977 --> 03:53:39,600
and Edge properties
yielding an rdd edge triplet

5515
03:53:39,600 --> 03:53:42,700
with vertex property
and your Edge property.

5516
03:53:42,700 --> 03:53:45,174
So as you can see
it gives an rdd.

5517
03:53:45,174 --> 03:53:47,217
D with s triplet and then it

5518
03:53:47,217 --> 03:53:51,523
has vertex property as well as
H property associated with it

5519
03:53:51,523 --> 03:53:55,100
and it contains an instance
of each triplet class.

5520
03:53:55,200 --> 03:53:55,700
Now.

5521
03:53:55,700 --> 03:53:57,800
I am taking example of a join.

5522
03:53:57,800 --> 03:54:01,603
So in this joint we are trying
to select Source ID destination

5523
03:54:01,603 --> 03:54:03,100
ID Source attribute then

5524
03:54:03,100 --> 03:54:04,635
this is your Edge attribute

5525
03:54:04,635 --> 03:54:07,400
and then at last you
have destination attribute.

5526
03:54:07,400 --> 03:54:11,200
So basically your edges has
Alias e then your vertices

5527
03:54:11,200 --> 03:54:12,907
has Alias as source.

5528
03:54:12,907 --> 03:54:16,516
And again your vertices
has Alias as Nation so we

5529
03:54:16,516 --> 03:54:19,900
are trying to select
Source ID destination ID,

5530
03:54:19,900 --> 03:54:23,155
then Source, attribute
and destination attribute,

5531
03:54:23,155 --> 03:54:25,800
and we also selecting
The Edge attribute

5532
03:54:25,800 --> 03:54:28,200
and we are performing left join.

5533
03:54:28,400 --> 03:54:31,900
The edge Source ID should
be equal to Source ID

5534
03:54:31,900 --> 03:54:35,600
and the h destination ID should
be equal to destination ID.

5535
03:54:36,400 --> 03:54:39,700
And now your Edge
triplet class basically

5536
03:54:39,700 --> 03:54:43,090
extends your Edge class
by adding your Source attribute

5537
03:54:43,090 --> 03:54:45,100
and destination
attribute members

5538
03:54:45,100 --> 03:54:48,100
which contains the source
and destination properties

5539
03:54:48,200 --> 03:54:49,155
and we can use

5540
03:54:49,155 --> 03:54:52,500
the triplet view of a graph
to render a collection

5541
03:54:52,500 --> 03:54:55,804
of strings describing
relationship between users.

5542
03:54:55,804 --> 03:54:59,521
This is vertex 1 which is again
denoting your user one.

5543
03:54:59,521 --> 03:55:01,986
That is Thomas and
who is a professor

5544
03:55:01,986 --> 03:55:03,081
and is vertex 3,

5545
03:55:03,081 --> 03:55:06,400
which is denoting you Jenny
and she's a student.

5546
03:55:06,400 --> 03:55:07,994
And this is your Edge,

5547
03:55:07,994 --> 03:55:11,400
which is defining
the relationship between them.

5548
03:55:11,400 --> 03:55:13,600
So this is a h triplet

5549
03:55:13,600 --> 03:55:17,300
which is denoting
the both vertex as well

5550
03:55:17,300 --> 03:55:20,900
as the edge which denote
the relation between them.

5551
03:55:20,900 --> 03:55:23,600
So now looking at this code
first we have already

5552
03:55:23,600 --> 03:55:26,377
created the graph then we
are taking this graph.

5553
03:55:26,377 --> 03:55:27,979
We are finding the triplets

5554
03:55:27,979 --> 03:55:30,194
and then we are
mapping each triplet.

5555
03:55:30,194 --> 03:55:33,700
We are trying to find out
the triplet dot Source attribute

5556
03:55:33,700 --> 03:55:36,155
in which we are picking
up the username.

5557
03:55:36,155 --> 03:55:37,100
Then over here.

5558
03:55:37,100 --> 03:55:39,800
We are trying to pick up
the triplet attribute,

5559
03:55:39,800 --> 03:55:42,400
which is nothing
but the edge attribute

5560
03:55:42,400 --> 03:55:44,400
which is your academic advisor.

5561
03:55:44,400 --> 03:55:45,800
Then we are trying

5562
03:55:45,800 --> 03:55:48,800
to pick up the triplet
destination attribute.

5563
03:55:48,800 --> 03:55:50,904
It will again pick
up the username

5564
03:55:50,904 --> 03:55:52,500
of destination attribute,

5565
03:55:52,500 --> 03:55:54,766
which is username
of this vertex 3.

5566
03:55:54,766 --> 03:55:57,100
So for an example
in this situation,

5567
03:55:57,100 --> 03:56:01,000
it will print Thomas is
the academic advisor of Jenny.

5568
03:56:01,000 --> 03:56:03,211
So then we are trying
to take this facts.

5569
03:56:03,211 --> 03:56:04,726
We are collecting the facts

5570
03:56:04,726 --> 03:56:07,900
using this forage we have
Painting each of the triplet

5571
03:56:07,900 --> 03:56:09,812
that is present in this graph.

5572
03:56:09,812 --> 03:56:10,385
So I hope

5573
03:56:10,385 --> 03:56:13,700
that you guys are clear
with the concepts of triplet.

5574
03:56:14,600 --> 03:56:17,300
So now let's quickly take
a look at graph Builders.

5575
03:56:17,353 --> 03:56:19,200
So as I already told you

5576
03:56:19,200 --> 03:56:22,700
that Graphics provides
several ways of building a graph

5577
03:56:22,700 --> 03:56:25,551
from a collection of vertices
and edges either.

5578
03:56:25,551 --> 03:56:28,900
It can be stored in our DD
or it can be stored on disk.

5579
03:56:28,900 --> 03:56:32,600
So in this graph object first,
we have this apply method.

5580
03:56:32,600 --> 03:56:36,300
So basically this apply
method allows creating a graph

5581
03:56:36,300 --> 03:56:37,773
from rdd of vertices

5582
03:56:37,773 --> 03:56:42,000
and edges and duplicate vertices
are picked up our by Tralee

5583
03:56:42,000 --> 03:56:43,139
and the vertices

5584
03:56:43,139 --> 03:56:46,700
which are found in the Edge rdd
and are not present

5585
03:56:46,700 --> 03:56:50,522
in the vertices rdd are assigned
a default attribute.

5586
03:56:50,522 --> 03:56:52,653
So in this apply method first,

5587
03:56:52,653 --> 03:56:55,100
we are providing
the vertex rdd then

5588
03:56:55,100 --> 03:56:57,000
we are providing the edge rdd

5589
03:56:57,000 --> 03:57:00,311
and then we are providing
the default vertex attribute.

5590
03:57:00,311 --> 03:57:03,613
So it will create the vertex
which we have specified.

5591
03:57:03,613 --> 03:57:05,400
Then it will create the edges

5592
03:57:05,400 --> 03:57:08,700
which are specified and
if there is a vertex

5593
03:57:08,700 --> 03:57:11,173
which is being referred
by The Edge,

5594
03:57:11,173 --> 03:57:14,000
but it is not present
in this vertex rdd.

5595
03:57:14,000 --> 03:57:16,763
So So what it does it
creates that vertex

5596
03:57:16,763 --> 03:57:20,900
and assigns them the value of
this default vertex attribute.

5597
03:57:20,900 --> 03:57:22,700
Next we have from edges.

5598
03:57:22,700 --> 03:57:27,000
So graph Dot from edges
allows creating a graph only

5599
03:57:27,000 --> 03:57:28,900
from the rdd of edges

5600
03:57:29,000 --> 03:57:32,266
which automatically creates
any vertices mentioned

5601
03:57:32,266 --> 03:57:35,400
in the edges and assigns
them the default value.

5602
03:57:35,500 --> 03:57:39,000
So what happens over here
you provide the edge rdd

5603
03:57:39,000 --> 03:57:40,496
and all the vertices

5604
03:57:40,496 --> 03:57:44,385
that are present in the hrd
are automatically created

5605
03:57:44,385 --> 03:57:48,500
and Default value is assigned
to each of those vertices.

5606
03:57:48,500 --> 03:57:49,522
So graphed out

5607
03:57:49,522 --> 03:57:53,100
from adjustables basically
allows creating a graph

5608
03:57:53,100 --> 03:57:55,484
from only the rdd of vegetables

5609
03:57:55,500 --> 03:58:00,100
and it assigns the edges as
value 1 and again the vertices

5610
03:58:00,100 --> 03:58:04,200
which are specified by the edges
are automatically created

5611
03:58:04,200 --> 03:58:05,788
and the default value which

5612
03:58:05,788 --> 03:58:09,005
we are specifying over here
will be allocated to them.

5613
03:58:09,005 --> 03:58:10,100
So basically you're

5614
03:58:10,100 --> 03:58:12,980
from has double supports
deduplicating of edges,

5615
03:58:12,980 --> 03:58:15,800
which means you can remove
the duplicate edges,

5616
03:58:15,800 --> 03:58:19,373
but for that you have
to provide a partition strategy

5617
03:58:19,373 --> 03:58:23,953
in the unique edges parameter
as it is necessary to co-locate

5618
03:58:23,953 --> 03:58:25,277
The Identical edges

5619
03:58:25,277 --> 03:58:28,900
on the same partition duplicate
edges can be removed.

5620
03:58:29,100 --> 03:58:33,000
So moving ahead men of the graph
Builders re partitions,

5621
03:58:33,000 --> 03:58:37,146
the graph edges by default
instead edges are left

5622
03:58:37,146 --> 03:58:39,300
in their default partitions.

5623
03:58:39,300 --> 03:58:42,540
So as you can see,
we have a graph loader object,

5624
03:58:42,540 --> 03:58:44,700
which is basically used to load.

5625
03:58:44,700 --> 03:58:46,776
Crafts from the file system

5626
03:58:46,900 --> 03:58:51,571
so graft or group edges requires
the graph to be re-partition

5627
03:58:51,571 --> 03:58:52,956
because it assumes

5628
03:58:53,000 --> 03:58:55,900
that identical edges
will be co-located

5629
03:58:55,900 --> 03:58:57,378
on the same partition.

5630
03:58:57,378 --> 03:59:00,200
And so you must call
graph dot Partition by

5631
03:59:00,200 --> 03:59:02,200
before calling group edges.

5632
03:59:02,900 --> 03:59:07,500
So so now you can see the edge
list file method over here

5633
03:59:07,538 --> 03:59:12,000
which provides a way to load
a graph from the list of edges

5634
03:59:12,000 --> 03:59:14,577
which is present
on the disk and it

5635
03:59:14,577 --> 03:59:18,900
It passes the adjacency list
that is your Source vertex ID

5636
03:59:18,900 --> 03:59:22,900
and the destination vertex ID
Pairs and it creates a graph.

5637
03:59:23,200 --> 03:59:24,300
So now for an example,

5638
03:59:24,300 --> 03:59:29,600
let's say we have two and one
which is one Edge then you have

5639
03:59:29,600 --> 03:59:31,533
for one which is another Edge

5640
03:59:31,533 --> 03:59:34,600
and then you have 1/2
which is another Edge.

5641
03:59:34,600 --> 03:59:36,700
So it will load these edges

5642
03:59:36,900 --> 03:59:39,300
and then it will
create the graph.

5643
03:59:39,300 --> 03:59:40,792
So it will create 2,

5644
03:59:40,792 --> 03:59:44,600
then it will create
for and then it will create one.

5645
03:59:44,900 --> 03:59:46,100
And for to one it

5646
03:59:46,100 --> 03:59:49,757
will create the edge and then
for one it will create the edge

5647
03:59:49,757 --> 03:59:52,500
and at last we create
an edge for one and two.

5648
03:59:52,700 --> 03:59:55,300
So do you create a graph
something like this?

5649
03:59:56,000 --> 03:59:59,100
It creates a graph
from specified edges

5650
03:59:59,300 --> 04:00:01,929
where automatically
vertices are created

5651
04:00:01,929 --> 04:00:05,751
which are mentioned by the edges
and all the vertex

5652
04:00:05,751 --> 04:00:08,465
and Edge attribute
are set by default one

5653
04:00:08,465 --> 04:00:10,907
and as well as one
will be associated

5654
04:00:10,907 --> 04:00:12,400
with all the vertices.

5655
04:00:12,543 --> 04:00:15,900
So it will be 4 comma
1 then again for this.

5656
04:00:15,900 --> 04:00:19,200
It would be 1 comma
1 and similarly it would be

5657
04:00:19,200 --> 04:00:21,201
2 comma 1 for this vertex.

5658
04:00:21,800 --> 04:00:24,184
Now, let's go back to the code.

5659
04:00:24,184 --> 04:00:27,800
So then we have
this canonical orientation.

5660
04:00:28,200 --> 04:00:31,655
So this argument
allows reorienting edges

5661
04:00:31,655 --> 04:00:33,500
in the positive direction

5662
04:00:33,500 --> 04:00:35,100
that is from the lower Source ID

5663
04:00:35,100 --> 04:00:38,000
to the higher
destination ID now,

5664
04:00:38,000 --> 04:00:40,800
which is basically required
by your connected components

5665
04:00:40,800 --> 04:00:41,782
algorithm will talk

5666
04:00:41,782 --> 04:00:43,800
about this algorithm
in a while you guys

5667
04:00:44,100 --> 04:00:47,069
but before this
this basically helps

5668
04:00:47,069 --> 04:00:49,300
in view orienting your edges,

5669
04:00:49,300 --> 04:00:51,500
which means your Source vertex,

5670
04:00:51,500 --> 04:00:55,400
Tex should always be less
than your destination vertex.

5671
04:00:55,400 --> 04:00:58,700
So in that situation it
might reorient this Edge.

5672
04:00:58,700 --> 04:01:01,970
So it will reorient this Edge
and basically to reverse

5673
04:01:01,970 --> 04:01:04,862
direction of the edge
similarly over here.

5674
04:01:04,862 --> 04:01:06,000
So with the vertex

5675
04:01:06,000 --> 04:01:08,896
which is coming from 2 to 1
will be reoriented

5676
04:01:08,896 --> 04:01:10,700
and will be again reversed.

5677
04:01:10,700 --> 04:01:11,754
Now the talking

5678
04:01:11,754 --> 04:01:16,300
about the minimum Edge partition
this minimum Edge partition

5679
04:01:16,300 --> 04:01:18,858
basically specifies
the minimum number

5680
04:01:18,858 --> 04:01:21,900
of edge partitions
to generate There might be

5681
04:01:21,900 --> 04:01:24,242
more Edge partitions
than a specified.

5682
04:01:24,242 --> 04:01:26,900
So let's say the hdfs
file has more blocks.

5683
04:01:26,900 --> 04:01:29,300
So obviously more partitions
will be created

5684
04:01:29,300 --> 04:01:32,182
but this will give you
the minimum Edge partitions

5685
04:01:32,182 --> 04:01:33,651
that should be created.

5686
04:01:33,651 --> 04:01:34,192
So I hope

5687
04:01:34,192 --> 04:01:36,900
that you guys are clear
with this graph loader

5688
04:01:36,900 --> 04:01:38,358
how this graph loader Works

5689
04:01:38,358 --> 04:01:41,300
how you can go ahead
and provide the edge list file

5690
04:01:41,300 --> 04:01:43,300
and how it will create the craft

5691
04:01:43,300 --> 04:01:47,124
from this Edge list file and
then this canonical orientation

5692
04:01:47,124 --> 04:01:50,300
where we are again going
and reorienting the graph

5693
04:01:50,300 --> 04:01:52,299
and then we have
Minimum Edge partition

5694
04:01:52,299 --> 04:01:54,900
which is giving the minimum
number of edge partitions

5695
04:01:54,900 --> 04:01:56,300
that should be created.

5696
04:01:56,300 --> 04:02:00,000
So now I guess you guys are
clear with the graph Builder.

5697
04:02:00,000 --> 04:02:03,400
So how to go ahead and use
this graph object

5698
04:02:03,400 --> 04:02:06,900
and how to create graph
using apply from edges

5699
04:02:06,900 --> 04:02:09,200
and from vegetables method

5700
04:02:09,400 --> 04:02:11,700
and then I guess
you might be clear

5701
04:02:11,700 --> 04:02:13,586
with the graph loader object

5702
04:02:13,586 --> 04:02:17,715
and where you can go ahead and
create a graph from Edge list.

5703
04:02:17,715 --> 04:02:17,990
Now.

5704
04:02:17,990 --> 04:02:21,500
Let's move ahead and talk
about vertex and Edge rdd.

5705
04:02:21,900 --> 04:02:23,561
So as I already told you

5706
04:02:23,561 --> 04:02:27,007
that Graphics exposes
our DD views of the vertices

5707
04:02:27,007 --> 04:02:30,056
and edges stored
within the graph at however,

5708
04:02:30,056 --> 04:02:33,798
because Graphics again
maintains the vertices and edges

5709
04:02:33,798 --> 04:02:35,600
in optimize data structure

5710
04:02:35,600 --> 04:02:36,979
and these data structure

5711
04:02:36,979 --> 04:02:39,499
provide additional
functionalities as well.

5712
04:02:39,499 --> 04:02:42,679
Now, let us see some of
the additional functionalities

5713
04:02:42,679 --> 04:02:44,300
which are provided by them.

5714
04:02:44,465 --> 04:02:47,234
So let's first talk
about vertex rdd.

5715
04:02:47,600 --> 04:02:51,100
So I already told
you that vertex rdd.

5716
04:02:51,100 --> 04:02:54,800
He is basically extending
this rdd with vertex ID

5717
04:02:54,800 --> 04:02:59,338
and the vertex property and it
adds an additional constraint

5718
04:02:59,338 --> 04:03:05,600
that each vertex ID occurs only
words now moreover vertex rdd

5719
04:03:05,800 --> 04:03:10,000
a represents a set of vertices
each with an attribute

5720
04:03:10,000 --> 04:03:12,600
of type A now internally

5721
04:03:12,700 --> 04:03:17,600
what happens this is achieved
by storing the vertex attribute

5722
04:03:17,700 --> 04:03:19,184
in an reusable,

5723
04:03:19,184 --> 04:03:21,030
hash map data structure.

5724
04:03:24,200 --> 04:03:27,700
So suppose, this is
our hash map data structure.

5725
04:03:27,700 --> 04:03:30,200
So suppose if to vertex rdd

5726
04:03:30,200 --> 04:03:34,840
are derived from the same
base vertex rdd suppose.

5727
04:03:35,280 --> 04:03:37,600
These are two vertex rdd

5728
04:03:37,600 --> 04:03:41,200
which are basically derived
from this vertex rdd

5729
04:03:41,200 --> 04:03:44,400
so they can be joined
in constant time

5730
04:03:44,400 --> 04:03:46,100
without hash evaluations.

5731
04:03:46,100 --> 04:03:49,400
So you don't have to go ahead
and evaluate the properties

5732
04:03:49,400 --> 04:03:52,400
of both the vertices
you can easily go ahead

5733
04:03:52,400 --> 04:03:55,398
and you can join them
without the Yes,

5734
04:03:55,400 --> 04:03:58,288
and this is one of the way
in which this vertex

5735
04:03:58,288 --> 04:04:00,800
already provides you
the optimization now

5736
04:04:00,800 --> 04:04:03,900
to leverage this
indexed data structure

5737
04:04:04,200 --> 04:04:08,700
the vertex rdd exposes multiple
additional functionalities.

5738
04:04:09,000 --> 04:04:11,000
So it gives you
all these functions

5739
04:04:11,000 --> 04:04:12,000
as you can see here.

5740
04:04:12,300 --> 04:04:15,300
It gives you filter
map values then -

5741
04:04:15,300 --> 04:04:16,663
difference left join

5742
04:04:16,663 --> 04:04:19,800
in a joint and aggregate
using index functions.

5743
04:04:19,800 --> 04:04:22,600
So let us first discuss
about these functions.

5744
04:04:22,600 --> 04:04:26,800
So basically filter a function
filters the vertex set

5745
04:04:26,800 --> 04:04:31,700
but preserves the internal index
So based on some condition.

5746
04:04:31,700 --> 04:04:33,405
It filters the vertices

5747
04:04:33,405 --> 04:04:36,300
that are present
then in map values.

5748
04:04:36,300 --> 04:04:39,200
It is basically used
to transform the values

5749
04:04:39,200 --> 04:04:41,000
without changing the IDS

5750
04:04:41,000 --> 04:04:44,461
and which again preserves
your internal index.

5751
04:04:44,461 --> 04:04:49,399
So it does not change the idea
of the vertices and it helps

5752
04:04:49,399 --> 04:04:53,100
in transforming those values
now talking about the -

5753
04:04:53,100 --> 04:04:55,900
method it shows What is unique

5754
04:04:55,900 --> 04:04:58,500
in the said based
on their vertex IDs?

5755
04:04:58,500 --> 04:04:59,500
So what happens

5756
04:04:59,500 --> 04:05:03,300
if you are providing to set
of vertices first contains V1 V2

5757
04:05:03,300 --> 04:05:06,100
and V3 and second
one contains V3,

5758
04:05:06,200 --> 04:05:08,276
so it will return V1 and V2

5759
04:05:08,276 --> 04:05:11,366
because they are unique
in both the sets

5760
04:05:11,700 --> 04:05:14,700
and it is basically done
with the help of vertex ID.

5761
04:05:14,900 --> 04:05:17,053
So next we have dysfunction.

5762
04:05:17,100 --> 04:05:20,900
So it basically removes
the vertices from this set

5763
04:05:20,900 --> 04:05:25,800
that appears in another set Then
we have left join an inner join.

5764
04:05:25,800 --> 04:05:28,300
So join operators
basically take advantage

5765
04:05:28,300 --> 04:05:30,900
of the internal indexing
to accelerate join.

5766
04:05:30,900 --> 04:05:32,900
So you can go ahead
and you can perform left join

5767
04:05:32,900 --> 04:05:34,400
or you can perform inner join.

5768
04:05:34,453 --> 04:05:37,246
Next you have
aggregate using index.

5769
04:05:37,700 --> 04:05:40,800
So basically is aggregate
using index is nothing

5770
04:05:40,800 --> 04:05:42,400
by reduced by key,

5771
04:05:42,500 --> 04:05:44,200
but it uses index

5772
04:05:44,300 --> 04:05:48,000
on this rdd to accelerate
the Reduce by key function

5773
04:05:48,000 --> 04:05:50,500
or you can say reduced
by key operation.

5774
04:05:50,700 --> 04:05:54,900
So again filter is actually
Using bit set and there

5775
04:05:54,900 --> 04:05:56,500
by reusing the index

5776
04:05:56,500 --> 04:05:58,800
and preserving the ability to do

5777
04:05:58,800 --> 04:06:02,220
fast joints with other
vertex rdd now similarly

5778
04:06:02,220 --> 04:06:04,600
the map values operator as well.

5779
04:06:04,600 --> 04:06:08,200
Do not allow the map function
to change the vertex ID

5780
04:06:08,200 --> 04:06:09,600
and this again helps

5781
04:06:09,600 --> 04:06:13,120
in reusing the same
hash map data structure now both

5782
04:06:13,120 --> 04:06:14,533
of your left join as

5783
04:06:14,533 --> 04:06:17,900
well as your inner join
is able to identify

5784
04:06:17,900 --> 04:06:20,400
that whether the two vertex rdd

5785
04:06:20,400 --> 04:06:23,169
which are joining
are derived from the same.

5786
04:06:23,169 --> 04:06:24,208
Hash map or not.

5787
04:06:24,208 --> 04:06:28,300
And for this they basically use
linear scan did again don't have

5788
04:06:28,300 --> 04:06:31,900
to go ahead and search
for costly Point lookups.

5789
04:06:31,900 --> 04:06:35,300
So this is the benefit
of using vertex rdd.

5790
04:06:35,500 --> 04:06:36,571
So to summarize

5791
04:06:36,571 --> 04:06:40,300
your vertex audit abuses
hash map data structure,

5792
04:06:40,426 --> 04:06:42,273
which is again reusable.

5793
04:06:42,300 --> 04:06:44,700
They try to
preserve your indexes

5794
04:06:44,700 --> 04:06:48,500
so that it would be easier
to create a new vertex already

5795
04:06:48,500 --> 04:06:51,404
derive a new vertex already
from them then again

5796
04:06:51,404 --> 04:06:54,000
while performing some
joining or Relations,

5797
04:06:54,000 --> 04:06:57,900
it is pretty much easy to go
ahead perform a linear scan

5798
04:06:57,900 --> 04:07:01,500
and then you can go ahead
and join those two vertex rdd.

5799
04:07:01,500 --> 04:07:05,423
So it actually helps
in optimizing your performance.

5800
04:07:05,700 --> 04:07:06,700
Now moving ahead.

5801
04:07:06,700 --> 04:07:10,200
Let's talk about
HR DD now again,

5802
04:07:10,200 --> 04:07:13,900
as you can see your Edge
already is extending your rdd

5803
04:07:13,900 --> 04:07:15,400
with property Edge.

5804
04:07:15,400 --> 04:07:18,792
Now it organizes the edge
in Block partition using

5805
04:07:18,792 --> 04:07:21,700
one of the various
partitioning strategies,

5806
04:07:21,700 --> 04:07:25,608
which is again defined in Your
partition strategies attribute

5807
04:07:25,608 --> 04:07:28,800
or you can say partition
strategy parameter within

5808
04:07:28,800 --> 04:07:30,865
each partition each attribute

5809
04:07:30,865 --> 04:07:34,100
and a decency structure
are stored separately

5810
04:07:34,100 --> 04:07:36,200
which enables the maximum reuse

5811
04:07:36,200 --> 04:07:38,200
when changing the
attribute values.

5812
04:07:38,600 --> 04:07:42,900
So basically what it does while
storing your Edge attributes

5813
04:07:42,900 --> 04:07:46,400
and your Source vertex
and destination vertex,

5814
04:07:46,400 --> 04:07:48,400
they are stored separately so

5815
04:07:48,400 --> 04:07:51,200
that changing the values
of the attributes

5816
04:07:51,200 --> 04:07:54,200
either of the source
Vertex or Nation Vertex

5817
04:07:54,200 --> 04:07:55,500
or Edge attribute

5818
04:07:55,500 --> 04:07:58,300
so that it can be
reused as many times

5819
04:07:58,300 --> 04:08:01,600
as we need by changing
the attribute values itself.

5820
04:08:01,600 --> 04:08:04,713
So that once the vertex ID
is changed of an edge.

5821
04:08:04,713 --> 04:08:06,400
It could be easily changed

5822
04:08:06,400 --> 04:08:09,196
and the earlier part
can be reused now

5823
04:08:09,196 --> 04:08:10,314
as you can see,

5824
04:08:10,314 --> 04:08:13,518
we have three additional
functions over here

5825
04:08:13,518 --> 04:08:16,500
that is map values
reverse an inner join.

5826
04:08:16,700 --> 04:08:19,000
So in hrd basically map

5827
04:08:19,000 --> 04:08:21,400
values is to transform
the edge attributes

5828
04:08:21,400 --> 04:08:23,200
while preserving the structure.

5829
04:08:23,200 --> 04:08:25,029
ER it is helpful in transforming

5830
04:08:25,029 --> 04:08:28,500
so you can use map values and
map the values of Courage rdd.

5831
04:08:28,800 --> 04:08:31,300
Then you can go ahead and use
this reverse function

5832
04:08:31,300 --> 04:08:35,400
which rivers The Edge reusing
both attribute and structure.

5833
04:08:35,400 --> 04:08:37,531
So the source
becomes destination.

5834
04:08:37,531 --> 04:08:40,179
The destination becomes
Source not talking

5835
04:08:40,179 --> 04:08:41,600
about this inner join.

5836
04:08:41,700 --> 04:08:43,600
So it basically joins

5837
04:08:43,600 --> 04:08:48,500
to Edge rdds partitioned using
same partitioning strategy.

5838
04:08:49,100 --> 04:08:52,900
Now as we already discuss
that same partition strategies,

5839
04:08:52,900 --> 04:08:55,585
Tired because again
to co-locate you need

5840
04:08:55,585 --> 04:08:57,600
to use same partition strategy

5841
04:08:57,600 --> 04:08:59,682
and your identical
vertex should reside

5842
04:08:59,682 --> 04:09:02,800
in same partition to perform
join operation over them.

5843
04:09:02,800 --> 04:09:03,092
Now.

5844
04:09:03,092 --> 04:09:07,290
Let me quickly give you an idea
about optimization performed

5845
04:09:07,290 --> 04:09:08,500
in this Graphics.

5846
04:09:08,536 --> 04:09:10,151
So Graphics basically

5847
04:09:10,151 --> 04:09:14,844
adopts a Vertex cut approach to
distribute graph partitioning.

5848
04:09:15,500 --> 04:09:20,700
So suppose you have five vertex
and then they are connected.

5849
04:09:20,800 --> 04:09:23,100
Let's not worry
about the arrows, right?

5850
04:09:23,100 --> 04:09:26,200
Now or let's not worry
about Direction right now.

5851
04:09:26,200 --> 04:09:29,200
So either it can be divided
from the edges,

5852
04:09:29,200 --> 04:09:32,287
which is one approach or again.

5853
04:09:32,287 --> 04:09:34,825
It can be divided
from the vertex.

5854
04:09:35,300 --> 04:09:36,840
So in that situation,

5855
04:09:36,840 --> 04:09:39,700
it would be divided
something like this.

5856
04:09:41,200 --> 04:09:43,500
So rather than splitting crafts

5857
04:09:43,500 --> 04:09:47,900
along edges Graphics partition
is the graph along vertices,

5858
04:09:47,900 --> 04:09:50,305
which can again
reduce the communication

5859
04:09:50,305 --> 04:09:51,600
and storage overhead.

5860
04:09:51,600 --> 04:09:53,523
So logically what happens

5861
04:09:53,523 --> 04:09:56,500
that your edges
are assigned to machines

5862
04:09:56,500 --> 04:10:00,200
and allowing your vertices
to span multiple machines.

5863
04:10:00,200 --> 04:10:03,500
So what this is is basically
divided into multiple machines

5864
04:10:03,500 --> 04:10:06,900
and your edges is assigned
to a single machine right

5865
04:10:06,900 --> 04:10:09,600
then the exact method
of assigning edges.

5866
04:10:09,600 --> 04:10:11,800
Depends on the
partition strategy.

5867
04:10:11,800 --> 04:10:15,400
So the partition strategy is
the one which basically decides

5868
04:10:15,400 --> 04:10:16,800
how to assign the edges

5869
04:10:16,800 --> 04:10:20,300
to different machines or you
can send different partitions.

5870
04:10:20,300 --> 04:10:21,400
So user can choose

5871
04:10:21,400 --> 04:10:24,900
between different strategies
by partitioning the graph

5872
04:10:24,900 --> 04:10:28,200
with the help of this graft
Partition by operator.

5873
04:10:28,200 --> 04:10:29,500
Now as we discussed

5874
04:10:29,500 --> 04:10:31,329
that this craft or Partition

5875
04:10:31,329 --> 04:10:34,400
by operator three partitions
and then it divides

5876
04:10:34,400 --> 04:10:36,900
or relocates the edges

5877
04:10:37,000 --> 04:10:39,900
and basically we try
to put the identical edges.

5878
04:10:39,900 --> 04:10:41,500
On a single partition

5879
04:10:41,500 --> 04:10:43,827
so that different
operations like join

5880
04:10:43,827 --> 04:10:45,400
can be performed on them.

5881
04:10:45,400 --> 04:10:49,629
So once the edges have been
partitioned the mean challenge

5882
04:10:49,629 --> 04:10:52,690
is efficiently joining
the vertex attributes

5883
04:10:52,690 --> 04:10:54,400
with the edges right now

5884
04:10:54,400 --> 04:10:56,000
because real world graphs

5885
04:10:56,000 --> 04:10:58,600
typically have more
edges than vertices.

5886
04:10:58,600 --> 04:11:03,300
So we move vertex attributes
to the edges and because not all

5887
04:11:03,300 --> 04:11:07,800
the partitions will contain
edges adjacent to all vertices.

5888
04:11:07,800 --> 04:11:09,755
We internally maintain a row.

5889
04:11:09,755 --> 04:11:10,700
Routing table.

5890
04:11:10,700 --> 04:11:14,400
So the routing table is the one
who will broadcast the vertices

5891
04:11:14,400 --> 04:11:18,146
and 10 will implement the join
required for the operations.

5892
04:11:18,146 --> 04:11:18,946
So, I hope

5893
04:11:18,946 --> 04:11:22,200
that you guys are clear
how vertex rdd and hrd

5894
04:11:22,200 --> 04:11:23,338
works and then

5895
04:11:23,338 --> 04:11:25,800
how the optimizations take place

5896
04:11:25,800 --> 04:11:29,900
and how vertex cut optimizes
the operations in graphics.

5897
04:11:30,100 --> 04:11:32,600
Now, let's talk
about graph operators.

5898
04:11:32,600 --> 04:11:35,480
So just as already
have basic operations

5899
04:11:35,480 --> 04:11:37,400
like map filter reduced by

5900
04:11:37,400 --> 04:11:41,300
key property graph also have
Election of basic operators

5901
04:11:41,300 --> 04:11:44,530
that take user-defined functions
and produce new graphs

5902
04:11:44,530 --> 04:11:48,029
the transform properties and
structure Now The Co-operators

5903
04:11:48,029 --> 04:11:50,900
that have optimized
implementation are basically

5904
04:11:50,900 --> 04:11:54,061
defined in crafts class
and convenient operators

5905
04:11:54,061 --> 04:11:55,262
that are expressed

5906
04:11:55,262 --> 04:11:57,600
as a composition
of The Co-operators

5907
04:11:57,600 --> 04:12:00,500
are basically defined
in your graphs class.

5908
04:12:00,500 --> 04:12:03,346
But in Scala it
implicit the operators

5909
04:12:03,346 --> 04:12:04,800
in graph Ops class,

5910
04:12:04,800 --> 04:12:08,500
they are automatically available
as a member of graft class

5911
04:12:08,600 --> 04:12:09,600
so you can use them.

5912
04:12:09,700 --> 04:12:12,450
M using the graph
class as well now

5913
04:12:12,500 --> 04:12:14,593
as you can see we have
list of operators

5914
04:12:14,593 --> 04:12:15,858
like property operator,

5915
04:12:15,858 --> 04:12:17,800
then you have
structural operator.

5916
04:12:17,800 --> 04:12:19,300
Then you have join operator

5917
04:12:19,300 --> 04:12:22,000
and then you have something
called neighborhood operator.

5918
04:12:22,000 --> 04:12:24,700
So let's talk about them one
by one now talking

5919
04:12:24,700 --> 04:12:26,400
about property operators,

5920
04:12:26,400 --> 04:12:30,016
like rdd has map operator
the property graph contains

5921
04:12:30,016 --> 04:12:34,168
map vertices map edges and map
triplets operators right now.

5922
04:12:34,168 --> 04:12:38,445
Each of this operator basically
eels a new graph with the vertex

5923
04:12:38,445 --> 04:12:39,600
or Edge property.

5924
04:12:39,600 --> 04:12:42,600
Modified by the user-defined
map function based

5925
04:12:42,600 --> 04:12:46,366
on the user-defined map function
it basically transforms

5926
04:12:46,366 --> 04:12:47,915
or modifies the vertices

5927
04:12:47,915 --> 04:12:49,202
if it's map vertices

5928
04:12:49,202 --> 04:12:51,489
or it transform
or modify the edges

5929
04:12:51,489 --> 04:12:53,170
if it is map edges method

5930
04:12:53,170 --> 04:12:56,600
or map is operator and so
on format repeats as well.

5931
04:12:56,600 --> 04:13:00,053
Now the important thing
to note is that in each case.

5932
04:13:00,053 --> 04:13:02,700
The graph structure
is unaffected and this

5933
04:13:02,700 --> 04:13:04,968
is a key feature
of these operators.

5934
04:13:04,968 --> 04:13:07,513
Basically which allows
the resulting graph

5935
04:13:07,513 --> 04:13:09,500
to reuse the structural indices.

5936
04:13:09,500 --> 04:13:10,300
Of the original graph

5937
04:13:10,300 --> 04:13:12,600
each and every time you
apply a transformation,

5938
04:13:12,600 --> 04:13:14,700
so it creates a new graph

5939
04:13:14,700 --> 04:13:17,500
and the original
graph is unaffected

5940
04:13:17,500 --> 04:13:19,200
so that it can be used

5941
04:13:19,200 --> 04:13:22,500
so you can see it can be reused
in creating new graphs.

5942
04:13:22,500 --> 04:13:22,800
Right?

5943
04:13:22,800 --> 04:13:24,600
So your structure indices

5944
04:13:24,600 --> 04:13:27,700
can be used from the original
graph not talking

5945
04:13:27,700 --> 04:13:29,400
about this map vertices.

5946
04:13:29,400 --> 04:13:31,152
Let me use the highlighter.

5947
04:13:31,152 --> 04:13:32,900
So first we have map vertices.

5948
04:13:32,900 --> 04:13:34,200
So be it Maps the vertices

5949
04:13:34,200 --> 04:13:36,100
or you can still
transform the vertices.

5950
04:13:36,100 --> 04:13:39,300
So you provide vertex ID
and then vertex.

5951
04:13:40,100 --> 04:13:43,400
And you apply some of the
transformation function using

5952
04:13:43,400 --> 04:13:46,600
which so it will give you
a graph with newer text property

5953
04:13:46,600 --> 04:13:49,500
as you can see now same is
the case with map edges.

5954
04:13:49,500 --> 04:13:53,800
So again you provide the edges
then you transform the edges.

5955
04:13:53,800 --> 04:13:57,600
So initially it was Ed and then
you transform it to Edie to

5956
04:13:57,700 --> 04:13:58,600
and then the graph

5957
04:13:58,600 --> 04:14:01,000
which is given or you
can see the graph

5958
04:14:01,000 --> 04:14:04,947
which is returned is the graph
for the changed each attribute.

5959
04:14:04,947 --> 04:14:07,535
So you can see here
the attribute is ed2.

5960
04:14:07,535 --> 04:14:09,800
Same is the case
with Mark triplets.

5961
04:14:09,900 --> 04:14:11,500
So using Mark triplets,

5962
04:14:11,500 --> 04:14:14,657
you can use the edge triplet
where you can go ahead

5963
04:14:14,657 --> 04:14:18,700
and Target the vertex Properties
or you can say vertex attributes

5964
04:14:18,700 --> 04:14:21,817
or to be more specific
Source vertex attribute as well

5965
04:14:21,817 --> 04:14:23,641
as destination vertex attribute

5966
04:14:23,641 --> 04:14:26,900
and the edge attribute and then
you can apply transformation

5967
04:14:26,900 --> 04:14:28,654
over those Source attributes

5968
04:14:28,654 --> 04:14:31,600
or destination attributes
or the edge attributes

5969
04:14:31,600 --> 04:14:34,500
so you can change them and then
it will again return a graph

5970
04:14:34,500 --> 04:14:36,300
with the transformed values now,

5971
04:14:36,300 --> 04:14:39,000
I guess you guys are clear
the property operator.

5972
04:14:39,000 --> 04:14:40,819
So let's move Next operator

5973
04:14:40,819 --> 04:14:44,958
that is structural operator So
currently Graphics supports only

5974
04:14:44,958 --> 04:14:48,200
a simple set of commonly
use structural operators.

5975
04:14:48,200 --> 04:14:50,712
And we expect more
to be added in future.

5976
04:14:50,712 --> 04:14:53,220
Now you can see
in structural operator.

5977
04:14:53,220 --> 04:14:54,800
We have reversed operator.

5978
04:14:54,800 --> 04:14:56,464
Then we have subgraph operator.

5979
04:14:56,464 --> 04:14:57,923
Then we have masks operator

5980
04:14:57,923 --> 04:15:00,100
and then we have
group edges operator.

5981
04:15:00,100 --> 04:15:04,096
So let's talk about them one by
one so first reverse operator,

5982
04:15:04,096 --> 04:15:05,640
so as the name suggests,

5983
04:15:05,640 --> 04:15:09,500
it returns a new graph with all
the edge directions reversed.

5984
04:15:09,500 --> 04:15:11,750
So basically it will change
your Source vertex

5985
04:15:11,750 --> 04:15:12,950
into destination vertex,

5986
04:15:12,950 --> 04:15:15,108
and then it will change
your destination vertex

5987
04:15:15,108 --> 04:15:16,000
into Source vertex.

5988
04:15:16,000 --> 04:15:18,500
So it will reverse
the direction of your edges.

5989
04:15:18,500 --> 04:15:21,600
And the reverse operation
does not modify Vertex

5990
04:15:21,600 --> 04:15:23,300
or Edge Properties or change.

5991
04:15:23,300 --> 04:15:24,300
The number of edges.

5992
04:15:24,400 --> 04:15:25,739
It can be implemented

5993
04:15:25,739 --> 04:15:28,800
efficiently without
data movement or duplication.

5994
04:15:28,800 --> 04:15:31,400
So next we have
subgraph operator.

5995
04:15:31,400 --> 04:15:34,615
So basically subgraph
operator takes the vertex

5996
04:15:34,615 --> 04:15:35,967
and Edge predicates

5997
04:15:35,967 --> 04:15:38,577
or you can say Vertex
or edge condition

5998
04:15:38,577 --> 04:15:41,600
and Returns the Of
containing only the vertex

5999
04:15:41,600 --> 04:15:44,835
that satisfy those vertex
predicates and then it Returns

6000
04:15:44,835 --> 04:15:47,306
the edges that satisfy
the edge predicates.

6001
04:15:47,306 --> 04:15:50,200
So basically will give
a condition about edges and

6002
04:15:50,200 --> 04:15:51,954
vertices and those predicates

6003
04:15:51,954 --> 04:15:54,009
which are fulfilled
or those vertex

6004
04:15:54,009 --> 04:15:57,303
which are fulfilling the
predicates will be only returned

6005
04:15:57,303 --> 04:15:59,302
and again seems the case
with your edges

6006
04:15:59,302 --> 04:16:01,237
and then your graph
will be connected.

6007
04:16:01,237 --> 04:16:03,800
Now, the subgraph operator
can be used in a number

6008
04:16:03,800 --> 04:16:06,953
of situations to restrict
the graph to the vertices

6009
04:16:06,953 --> 04:16:08,245
and edges of interest

6010
04:16:08,245 --> 04:16:10,615
and eliminate the Rest
of the components,

6011
04:16:10,615 --> 04:16:13,450
right so you can see
this is The Edge predicate.

6012
04:16:13,450 --> 04:16:15,200
This is the vertex predicate.

6013
04:16:15,200 --> 04:16:18,900
Then we are providing
the extra plate with the vertex

6014
04:16:18,900 --> 04:16:20,500
and Edge attributes

6015
04:16:20,500 --> 04:16:21,567
and we are waiting

6016
04:16:21,567 --> 04:16:24,700
for the Boolean value then
same is the case with vertex.

6017
04:16:24,700 --> 04:16:27,100
We're providing the vertex
properties over here

6018
04:16:27,100 --> 04:16:29,150
or you can say vertex
attribute over here.

6019
04:16:29,150 --> 04:16:29,925
And then again,

6020
04:16:29,925 --> 04:16:32,126
it will yield a graph
which is a sub graph

6021
04:16:32,126 --> 04:16:35,400
of the original graph which will
fulfill those predicates now,

6022
04:16:35,400 --> 04:16:37,600
the next operator
is mask operator.

6023
04:16:37,600 --> 04:16:39,746
So mask operator Constructors.

6024
04:16:39,746 --> 04:16:43,466
Graph by returning a graph
that contains the vertices

6025
04:16:43,466 --> 04:16:46,888
and edges that are also found
in the input graph.

6026
04:16:46,888 --> 04:16:48,637
Basically, you can treat

6027
04:16:48,637 --> 04:16:52,500
this mask operator as
a comparison between two graphs.

6028
04:16:52,500 --> 04:16:53,314
So suppose.

6029
04:16:53,314 --> 04:16:54,500
We are comparing

6030
04:16:54,500 --> 04:16:58,100
graph 1 and graph 2 and it
will return this sub graph

6031
04:16:58,100 --> 04:17:00,800
which is common in both
the graphs again.

6032
04:17:00,800 --> 04:17:04,600
This can be used in conjunction
with the subgraph operator.

6033
04:17:04,600 --> 04:17:05,900
Basically to restrict

6034
04:17:05,900 --> 04:17:09,400
a graph based on properties
in another related graph, right.

6035
04:17:09,400 --> 04:17:12,280
And so I guess you guys are
clear with the mask operator.

6036
04:17:12,280 --> 04:17:13,000
So we're here.

6037
04:17:13,000 --> 04:17:14,233
We're providing a graph

6038
04:17:14,233 --> 04:17:16,776
and then we are providing
the input graph as well.

6039
04:17:16,776 --> 04:17:18,671
And then it will return a graph

6040
04:17:18,671 --> 04:17:21,700
which is basically a subset
of both of these graph

6041
04:17:21,700 --> 04:17:23,600
not talking about group edges.

6042
04:17:23,600 --> 04:17:26,796
So the group edges operator
merges the parallel edges

6043
04:17:26,796 --> 04:17:28,446
in the multigraph, right?

6044
04:17:28,446 --> 04:17:29,683
So what it does it,

6045
04:17:29,683 --> 04:17:33,244
the duplicate edges between pair
of vertices are merged

6046
04:17:33,244 --> 04:17:35,800
or you can say are
at can be aggregated

6047
04:17:35,800 --> 04:17:37,325
or perform some action

6048
04:17:37,325 --> 04:17:41,000
and in many numerical
applications I just can be added

6049
04:17:41,000 --> 04:17:43,702
and their weights can be
combined into a single edge,

6050
04:17:43,702 --> 04:17:46,804
right which will again
reduce the size of the graph.

6051
04:17:46,804 --> 04:17:47,900
So for an example,

6052
04:17:47,900 --> 04:17:51,400
you have to vertex V1 and V2
and there are two edges

6053
04:17:51,400 --> 04:17:53,100
with weight 10 and 15.

6054
04:17:53,100 --> 04:17:56,291
So actually what you can do is
you can merge those two edges

6055
04:17:56,291 --> 04:17:59,700
if they have same direction and
you can represent the way to 25.

6056
04:17:59,700 --> 04:18:02,100
So this will actually
reduce the size

6057
04:18:02,100 --> 04:18:05,144
of the graph now looking
at the next operator,

6058
04:18:05,144 --> 04:18:06,700
which is join operator.

6059
04:18:06,700 --> 04:18:09,400
So in many cases
it is necessary.

6060
04:18:09,400 --> 04:18:13,151
To join data from external
collection with graphs, right?

6061
04:18:13,151 --> 04:18:13,909
For example.

6062
04:18:13,909 --> 04:18:16,100
We might have
an extra user property

6063
04:18:16,100 --> 04:18:18,855
that we want to merge
with the existing graph

6064
04:18:18,855 --> 04:18:21,186
or we might want
to pull vertex property

6065
04:18:21,186 --> 04:18:23,100
from one graph to another right.

6066
04:18:23,100 --> 04:18:24,700
So these are some
of the situations

6067
04:18:24,700 --> 04:18:27,000
where you go ahead and use
this join operators.

6068
04:18:27,000 --> 04:18:28,900
So now as you can see over here,

6069
04:18:28,900 --> 04:18:31,100
the first operator
is joined vertices.

6070
04:18:31,100 --> 04:18:34,792
So the joint vertices operator
joins the vertices

6071
04:18:34,792 --> 04:18:36,176
with the input rdd

6072
04:18:36,200 --> 04:18:39,516
and returns a new graph
with the vertex properties.

6073
04:18:39,516 --> 04:18:42,700
Dean after applying
the user-defined map function

6074
04:18:42,700 --> 04:18:45,400
now the vertices
without a matching value

6075
04:18:45,400 --> 04:18:49,500
in the rdd basically retains
their original value not talking

6076
04:18:49,500 --> 04:18:51,400
about outer join vertices.

6077
04:18:51,400 --> 04:18:55,100
So it behaves similar
to join vertices except that

6078
04:18:55,100 --> 04:18:59,586
which user-defined map function
is applied to all the vertices

6079
04:18:59,586 --> 04:19:02,200
and can change
the vertex property type.

6080
04:19:02,200 --> 04:19:05,600
So suppose that you have
a old graph which has

6081
04:19:05,600 --> 04:19:08,100
a Vertex attribute as old price

6082
04:19:08,200 --> 04:19:10,700
and then you created
a new a graph from it

6083
04:19:10,700 --> 04:19:13,735
and then it has the vertex
attribute as new rice.

6084
04:19:13,735 --> 04:19:16,645
So you can go ahead
and join two of these graphs

6085
04:19:16,645 --> 04:19:19,249
and you can perform
an aggregation of both

6086
04:19:19,249 --> 04:19:21,725
the Old and New prices
in the new graph.

6087
04:19:21,725 --> 04:19:25,265
So in this kind of situation
join vertices are used

6088
04:19:25,265 --> 04:19:26,389
now moving ahead.

6089
04:19:26,389 --> 04:19:29,814
Let's talk about neighborhood
aggregation now key step

6090
04:19:29,814 --> 04:19:33,239
in many graph analytics
is aggregating the information

6091
04:19:33,239 --> 04:19:36,600
about the neighborhood
of each vertex for an example.

6092
04:19:36,600 --> 04:19:39,500
We might want to know the number
of followers each user has

6093
04:19:39,700 --> 04:19:41,200
Or the average age

6094
04:19:41,200 --> 04:19:45,600
of the follower of each user now
many iterative graph algorithms,

6095
04:19:45,600 --> 04:19:47,416
like pagerank shortest path,

6096
04:19:47,416 --> 04:19:50,501
then connected components
repeatedly aggregate

6097
04:19:50,501 --> 04:19:52,893
the properties of
neighboring vertices.

6098
04:19:52,893 --> 04:19:56,200
Now, it has four operators
in neighborhood aggregation.

6099
04:19:56,200 --> 04:19:58,803
So the first one is
your aggregate messages.

6100
04:19:58,803 --> 04:20:01,500
So the core aggregation
operation in graphics

6101
04:20:01,500 --> 04:20:02,900
is aggregate messages.

6102
04:20:02,900 --> 04:20:04,090
Now this operator

6103
04:20:04,090 --> 04:20:07,100
applies a user-defined
send message function

6104
04:20:07,100 --> 04:20:10,799
as you can see over here
to Each of the edge triplet

6105
04:20:10,799 --> 04:20:11,600
in the graph

6106
04:20:11,600 --> 04:20:14,230
and then it uses
merge message function

6107
04:20:14,230 --> 04:20:17,900
to aggregate those messages
at the destination vertex.

6108
04:20:18,000 --> 04:20:19,900
Now the user-defined

6109
04:20:19,900 --> 04:20:23,150
send message function
takes an edge context

6110
04:20:23,150 --> 04:20:26,200
as you can see and
which exposes the source

6111
04:20:26,200 --> 04:20:29,892
and destination address Buttes
along with the edge attribute

6112
04:20:29,892 --> 04:20:32,399
and functions like send
to Source or send

6113
04:20:32,399 --> 04:20:35,303
to destination is used
to send messages to source

6114
04:20:35,303 --> 04:20:37,013
and destination attributes.

6115
04:20:37,013 --> 04:20:39,800
Now you can think
of send message as the map.

6116
04:20:39,800 --> 04:20:43,592
Function in mapreduce and
the user-defined merge function

6117
04:20:43,592 --> 04:20:46,000
which actually takes
the two messages

6118
04:20:46,000 --> 04:20:48,200
which are present
on the same Vertex

6119
04:20:48,200 --> 04:20:50,784
or you can see
the same destination vertex

6120
04:20:50,784 --> 04:20:52,090
and it again combines

6121
04:20:52,090 --> 04:20:55,662
or aggregate those messages
and produces a single message.

6122
04:20:55,662 --> 04:20:58,146
Now, you can think
of the merge message

6123
04:20:58,146 --> 04:21:00,500
as reduce function
the mapreduce now,

6124
04:21:00,500 --> 04:21:05,100
the aggregate messages operator
returns a Vertex rdd.

6125
04:21:05,100 --> 04:21:08,128
Basically, it contains
the aggregated messages at each

6126
04:21:08,128 --> 04:21:09,657
of the destination vertex.

6127
04:21:09,657 --> 04:21:10,600
It's and vertices

6128
04:21:10,600 --> 04:21:13,815
that did not receive
a message are not included

6129
04:21:13,815 --> 04:21:15,693
in the returned vertex rdd.

6130
04:21:15,693 --> 04:21:17,028
So only those vertex

6131
04:21:17,028 --> 04:21:20,500
are returned which actually
have received the message

6132
04:21:20,500 --> 04:21:22,956
and then those messages
have been merged.

6133
04:21:22,956 --> 04:21:25,250
If any vertex
which haven't received.

6134
04:21:25,250 --> 04:21:28,437
The message will not be included
in the returned rdd

6135
04:21:28,437 --> 04:21:31,500
or you can say a return
vertex rdd now in addition

6136
04:21:31,500 --> 04:21:34,000
as you can see we have
a triplets Fields.

6137
04:21:34,000 --> 04:21:37,519
So aggregate messages takes
an optional triplet fields,

6138
04:21:37,519 --> 04:21:39,400
which indicates what data is.

6139
04:21:39,400 --> 04:21:41,304
Accessed in the edge content.

6140
04:21:41,304 --> 04:21:42,752
So the possible options

6141
04:21:42,752 --> 04:21:45,900
for the triplet fields
are defined interpret fields

6142
04:21:45,900 --> 04:21:48,600
to default value
of triplet Fields is triplet

6143
04:21:48,600 --> 04:21:52,300
Fields oil as you can see over
here this basically indicates

6144
04:21:52,300 --> 04:21:55,600
that user-defined send
message function May access

6145
04:21:55,600 --> 04:21:58,074
any of the fields
in the edge content.

6146
04:21:58,074 --> 04:22:01,982
So this triplet field argument
can be used to notify Graphics

6147
04:22:01,982 --> 04:22:05,549
that only these part of
the edge content will be needed

6148
04:22:05,549 --> 04:22:09,491
which basically allows Graphics
to select the optimize joining.

6149
04:22:09,491 --> 04:22:10,700
Strategy, so I hope

6150
04:22:10,700 --> 04:22:13,500
that you guys are clear
with the aggregate messages.

6151
04:22:13,500 --> 04:22:16,794
Let's quickly move ahead
and look at the second operator.

6152
04:22:16,794 --> 04:22:20,019
So the second operator is
mapreduce triplet transition.

6153
04:22:20,019 --> 04:22:21,400
Now in earlier versions

6154
04:22:21,400 --> 04:22:24,700
of Graphics neighborhood
aggregation was accomplished

6155
04:22:24,700 --> 04:22:27,272
using the mapreduce
triplets operator.

6156
04:22:27,272 --> 04:22:29,802
This mapreduce triplet
operator is used

6157
04:22:29,802 --> 04:22:31,814
in older versions of Graphics.

6158
04:22:31,814 --> 04:22:35,100
This operator takes
the user-defined map function,

6159
04:22:35,100 --> 04:22:38,900
which is applied to each triplet
and can yield messages

6160
04:22:38,900 --> 04:22:42,300
which are Aggregating using the
user-defined reduce functions.

6161
04:22:42,300 --> 04:22:44,300
This one is the reason
I defined malfunction.

6162
04:22:44,300 --> 04:22:46,600
And this one is your user
defined reduce function.

6163
04:22:46,600 --> 04:22:49,081
So it basically applies
the map function

6164
04:22:49,081 --> 04:22:50,305
to all the triplets

6165
04:22:50,305 --> 04:22:53,654
and then the aggregate
those messages using this user

6166
04:22:53,654 --> 04:22:55,171
defined reduce function.

6167
04:22:55,171 --> 04:22:58,900
Now the newer version of this
map produced triplets operator

6168
04:22:58,900 --> 04:23:01,770
is the aggregate messages
now moving ahead.

6169
04:23:01,770 --> 04:23:04,900
Let's talk about Computing
degree information operator.

6170
04:23:04,900 --> 04:23:07,900
So one of the common
aggregation task is Computing

6171
04:23:07,900 --> 04:23:09,579
the degree of each vertex.

6172
04:23:09,579 --> 04:23:12,842
That is the number of edges
adjacent to each vertex.

6173
04:23:12,842 --> 04:23:15,072
Now in the context
of directed graph.

6174
04:23:15,072 --> 04:23:18,400
It is often necessary to know
the in degree out degree.

6175
04:23:18,400 --> 04:23:20,300
Then the total degree of vertex.

6176
04:23:20,300 --> 04:23:22,800
These kind of things are
pretty much important

6177
04:23:22,800 --> 04:23:25,389
and the graph Ops class
contain a collection

6178
04:23:25,389 --> 04:23:28,400
of operators to compute
the degrees of each vertex.

6179
04:23:28,500 --> 04:23:29,800
So as you can see,

6180
04:23:29,800 --> 04:23:33,100
we have maximum input degree
than maximum output degree,

6181
04:23:33,100 --> 04:23:36,100
then maximum degrees
maximum degree will tell

6182
04:23:36,100 --> 04:23:39,400
us the number of Maximum
incoming edges then Max.

6183
04:23:39,400 --> 04:23:42,325
Degree will tell us
maximum number of output edges

6184
04:23:42,325 --> 04:23:43,510
and this Max degree

6185
04:23:43,510 --> 04:23:46,685
with actually tell us the number
of input as well as

6186
04:23:46,685 --> 04:23:49,572
output edges now moving
ahead to next operator

6187
04:23:49,572 --> 04:23:52,300
that is collecting
Neighbors in some cases.

6188
04:23:52,300 --> 04:23:54,182
It may be easier to express

6189
04:23:54,182 --> 04:23:57,600
the computation by collecting
neighboring vertices

6190
04:23:57,600 --> 04:24:00,000
and their attribute
at each vertex.

6191
04:24:00,000 --> 04:24:02,624
Now, this can be easily
accomplished using

6192
04:24:02,624 --> 04:24:06,400
the collect neighbors ID and
the collect neighbors operator.

6193
04:24:06,400 --> 04:24:09,600
So basically your collect
neighbor ID takes

6194
04:24:09,600 --> 04:24:12,200
The Edge direction
as the parameter

6195
04:24:12,300 --> 04:24:14,400
and it returns a Vertex rdd

6196
04:24:14,400 --> 04:24:17,400
that contains the array
of vertex ID

6197
04:24:17,500 --> 04:24:20,000
that is neighboring
to the particular vertex

6198
04:24:20,000 --> 04:24:23,400
now similarly The Collection
neighbors again takes

6199
04:24:23,400 --> 04:24:25,717
the edge directions as the input

6200
04:24:25,717 --> 04:24:28,000
and it will return you the array

6201
04:24:28,000 --> 04:24:31,600
with the vertex ID and
the vertex attribute both now,

6202
04:24:31,600 --> 04:24:32,717
let me quickly open

6203
04:24:32,717 --> 04:24:35,700
my VM and let us go through
the spark directory first.

6204
04:24:35,900 --> 04:24:38,600
Let me first open
my terminal so first

6205
04:24:38,600 --> 04:24:41,800
I'll start the Do demons so
for that I will go

6206
04:24:41,800 --> 04:24:46,358
to her do phone directory
genocide has been start

6207
04:24:46,358 --> 04:24:48,282
or lot asset script file.

6208
04:24:52,000 --> 04:24:53,400
So let me check

6209
04:24:53,400 --> 04:24:55,700
if the Hadoop demons
are running or not.

6210
04:24:58,700 --> 04:25:00,706
So as you can see that name,

6211
04:25:00,706 --> 04:25:03,000
no data node
secondary name node,

6212
04:25:03,000 --> 04:25:05,848
the node manager
and resource manager.

6213
04:25:05,848 --> 04:25:08,400
All the Demons
of Hadoop are up now.

6214
04:25:08,400 --> 04:25:10,661
I will navigate to spark home.

6215
04:25:10,661 --> 04:25:13,300
Let me first start
this park demons.

6216
04:25:17,600 --> 04:25:19,700
I See Spark demons are running

6217
04:25:19,700 --> 04:25:24,000
alko first minimize this and let
me take you to this park home.

6218
04:25:24,900 --> 04:25:27,309
And this is my spot directories.

6219
04:25:27,309 --> 04:25:28,712
I'll go inside now.

6220
04:25:28,712 --> 04:25:30,926
Let me first show you the data

6221
04:25:30,926 --> 04:25:34,100
which is by default present
with your spark.

6222
04:25:34,400 --> 04:25:36,700
So we'll open this in a new tab.

6223
04:25:36,700 --> 04:25:38,865
So you can see
we have two files

6224
04:25:38,865 --> 04:25:41,100
in this Graphics data directory.

6225
04:25:41,100 --> 04:25:44,638
Meanwhile, let me take you
to the example code.

6226
04:25:44,638 --> 04:25:48,900
So this is example
and inside so main scalar.

6227
04:25:49,600 --> 04:25:50,500
You can find

6228
04:25:50,500 --> 04:25:54,700
the graphics directory and
inside this Graphics directory

6229
04:25:54,700 --> 04:25:59,000
you Some of the sample codes
which are present over here.

6230
04:25:59,000 --> 04:26:01,692
So I will take you
to this aggregate

6231
04:26:01,692 --> 04:26:05,100
messages example dots
Kayla now meanwhile,

6232
04:26:05,100 --> 04:26:07,287
let me open the data as well.

6233
04:26:07,287 --> 04:26:09,700
So you'll be able to understand.

6234
04:26:10,500 --> 04:26:12,967
Now this is
followers dot txt file.

6235
04:26:12,967 --> 04:26:15,000
So basically you can imagine

6236
04:26:15,000 --> 04:26:18,545
these are the edges which
are representing the vertex.

6237
04:26:18,545 --> 04:26:21,580
So this is what x 2
and this is vertex 1 then

6238
04:26:21,580 --> 04:26:25,100
this is Vertex 4 and this
is vertex 1 and similarly.

6239
04:26:25,100 --> 04:26:28,400
So on these are representing
those vertex and

6240
04:26:28,400 --> 04:26:30,900
if you can remember I
have already told you

6241
04:26:30,900 --> 04:26:33,200
that inside graph loader class.

6242
04:26:33,200 --> 04:26:35,818
There is a function
called Edge list file

6243
04:26:35,818 --> 04:26:37,200
which takes the edges

6244
04:26:37,200 --> 04:26:40,500
from a file and then it
construct the graph based.

6245
04:26:40,500 --> 04:26:43,800
That now second you
have this user dot txt.

6246
04:26:43,800 --> 04:26:47,550
So these are basically the edges
with the vertex ID.

6247
04:26:47,550 --> 04:26:51,200
So vertex ID for this vertex
is 1 then for this is 2

6248
04:26:51,200 --> 04:26:53,539
and so on and then
this is the data

6249
04:26:53,539 --> 04:26:57,600
which is attached or you can say
the attribute of the edges.

6250
04:26:57,600 --> 04:26:59,800
So these are the vertex ID

6251
04:26:59,958 --> 04:27:03,700
which is 1 2 3 respectively
and this is the data

6252
04:27:03,700 --> 04:27:06,800
which is associated
with your each vertex.

6253
04:27:06,800 --> 04:27:10,500
So this is username and this
might be the name of your user.

6254
04:27:10,500 --> 04:27:13,100
Zur and so on now
you can also see

6255
04:27:13,100 --> 04:27:16,900
that in some of the cases
the name of the user is missing.

6256
04:27:16,900 --> 04:27:18,800
So as in this case the name

6257
04:27:18,800 --> 04:27:22,100
of the user is missing
these are the vertices

6258
04:27:22,100 --> 04:27:26,300
or you can see the vertex ID
and vertex attributes.

6259
04:27:26,600 --> 04:27:30,500
Now, let me take you through
this aggregate messages example,

6260
04:27:30,600 --> 04:27:32,400
so as you can see,
we are giving the name

6261
04:27:32,400 --> 04:27:36,100
of the packages over G Apache
spark examples dot Graphics,

6262
04:27:36,300 --> 04:27:40,306
then we are importing Graphics
in that very important.

6263
04:27:40,306 --> 04:27:41,764
Off class as well as

6264
04:27:41,764 --> 04:27:45,700
this vertex rdd next we
are using this graph generator.

6265
04:27:45,700 --> 04:27:48,500
I'll tell you why we
are using this graph generator

6266
04:27:48,700 --> 04:27:52,400
and then we are using
the spark session over here.

6267
04:27:52,400 --> 04:27:54,105
So this is an example

6268
04:27:54,163 --> 04:27:58,778
where we are using the aggregate
messages operator to compute

6269
04:27:58,778 --> 04:28:03,163
the average age of the more
senior followers of each user.

6270
04:28:03,200 --> 04:28:03,700
Okay.

6271
04:28:03,928 --> 04:28:06,929
So this is the object
of aggregate messages example.

6272
04:28:07,000 --> 04:28:10,000
Now, this is the main function
where we are first.

6273
04:28:10,100 --> 04:28:13,600
Realizing this box session then
the name of the application.

6274
04:28:13,600 --> 04:28:16,400
So you have to provide the name
of the application

6275
04:28:16,400 --> 04:28:17,400
and this is get

6276
04:28:17,400 --> 04:28:20,600
or create method now
next you are initializing

6277
04:28:20,600 --> 04:28:24,338
the spark context as SC
now coming to the code.

6278
04:28:24,400 --> 04:28:27,400
So we are specifying
a graph then this graph

6279
04:28:27,400 --> 04:28:30,300
is containing double and N now.

6280
04:28:30,400 --> 04:28:33,200
I just told you that we
are importing craft generator.

6281
04:28:33,200 --> 04:28:35,023
So this graph generator is

6282
04:28:35,023 --> 04:28:37,900
to generate a random
graph for Simplicity.

6283
04:28:37,900 --> 04:28:40,400
So you would have multiple
number of edges and vertices.

6284
04:28:40,400 --> 04:28:43,047
Says then you are using
this log normal graph.

6285
04:28:43,047 --> 04:28:44,900
You're passing the spark context

6286
04:28:44,900 --> 04:28:47,677
and you're specifying the number
of vertices as hundred.

6287
04:28:47,677 --> 04:28:49,956
So it will generate
hundred vertices for you.

6288
04:28:49,956 --> 04:28:51,200
Then what you are doing.

6289
04:28:51,200 --> 04:28:53,400
You are specifying
the map vertices

6290
04:28:53,400 --> 04:28:56,815
and you're trying
to map ID to double so

6291
04:28:56,815 --> 04:28:58,200
what this would do

6292
04:28:58,200 --> 04:29:02,100
this will basically map
your ID to double then

6293
04:29:02,100 --> 04:29:05,700
in next year trying
to calculate the older followers

6294
04:29:05,700 --> 04:29:08,300
where you have given
it as vertex rdd

6295
04:29:08,300 --> 04:29:10,494
and then put is nth and Also,

6296
04:29:10,494 --> 04:29:13,900
your vertex already
has sent as your vertex ID

6297
04:29:13,900 --> 04:29:15,200
and your data is double

6298
04:29:15,200 --> 04:29:17,533
which is associated
with each of the vertex

6299
04:29:17,533 --> 04:29:19,604
or you can say
the vertex attribute.

6300
04:29:19,604 --> 04:29:20,900
So you have this graph

6301
04:29:20,900 --> 04:29:23,178
which is basically
generated randomly

6302
04:29:23,178 --> 04:29:26,189
and then you are performing
aggregate messages.

6303
04:29:26,189 --> 04:29:29,200
So this is the aggregate
messages operator now,

6304
04:29:29,200 --> 04:29:33,353
if you can remember we first
have the send messages, right?

6305
04:29:33,353 --> 04:29:35,000
So inside this triplet,

6306
04:29:35,000 --> 04:29:38,620
we are specifying a function
that if the source attribute

6307
04:29:38,620 --> 04:29:40,100
of the triplet is board.

6308
04:29:40,100 --> 04:29:42,300
Destination attribute
of the triplet.

6309
04:29:42,300 --> 04:29:43,900
So basically it will return

6310
04:29:43,900 --> 04:29:47,144
if the followers age
is greater than the age

6311
04:29:47,144 --> 04:29:48,452
of person whom he

6312
04:29:48,452 --> 04:29:52,259
is following this tells
the followers is is greater

6313
04:29:52,259 --> 04:29:55,000
than the age of whom
he is following.

6314
04:29:55,000 --> 04:29:56,462
So in that situation,

6315
04:29:56,462 --> 04:29:59,200
it will send message
to the destination

6316
04:29:59,200 --> 04:30:01,400
with vertex containing counter

6317
04:30:01,400 --> 04:30:05,000
that is 1 and the age
of the source attribute

6318
04:30:05,000 --> 04:30:07,700
that is the age
of the follower so first

6319
04:30:07,700 --> 04:30:10,800
so you can see the age
of the destination on is less

6320
04:30:10,800 --> 04:30:12,807
than the age
of source attribute.

6321
04:30:12,807 --> 04:30:14,000
So it will tell you

6322
04:30:14,000 --> 04:30:17,293
if the follower is older
than the user or not.

6323
04:30:17,293 --> 04:30:21,100
So in that situation will send
one to the destination

6324
04:30:21,100 --> 04:30:23,900
and we'll send the age
of the source

6325
04:30:23,900 --> 04:30:26,900
or you can see the edge
of the follower then second.

6326
04:30:26,900 --> 04:30:29,400
I have told you
that we have merged messages.

6327
04:30:29,500 --> 04:30:32,500
So here we are adding
the counter and the H

6328
04:30:32,600 --> 04:30:33,800
in this reduce function.

6329
04:30:33,900 --> 04:30:37,515
So now what we are doing we
are dividing the total age

6330
04:30:37,515 --> 04:30:38,421
of the number

6331
04:30:38,421 --> 04:30:41,439
of older followers
to Write an average age

6332
04:30:41,439 --> 04:30:42,700
of older followers.

6333
04:30:42,700 --> 04:30:45,400
So this is the reason why
we have passed the attribute

6334
04:30:45,400 --> 04:30:47,200
of source vertex firstly

6335
04:30:47,200 --> 04:30:49,300
if we are specifying
this variable that is

6336
04:30:49,300 --> 04:30:51,194
average age of older followers.

6337
04:30:51,194 --> 04:30:53,700
And then we are specifying
the vertex rdd.

6338
04:30:53,888 --> 04:30:58,211
So this will be double
and then this older followers

6339
04:30:58,292 --> 04:30:59,600
that is the graph

6340
04:30:59,600 --> 04:31:02,349
which we are picking up
from here and then we

6341
04:31:02,349 --> 04:31:04,100
are trying to map the value.

6342
04:31:04,100 --> 04:31:05,400
So in the vertex,

6343
04:31:05,400 --> 04:31:10,100
we have ID and we have value so
in this situation We

6344
04:31:10,100 --> 04:31:13,600
are using this case class
about count and total age.

6345
04:31:13,600 --> 04:31:16,000
So what we are doing we
are taking this total age

6346
04:31:16,000 --> 04:31:19,246
and we are dividing it by count
which we have gathered from this

6347
04:31:19,246 --> 04:31:20,011
send message.

6348
04:31:20,011 --> 04:31:22,800
And then we have aggregated
using this reduce function.

6349
04:31:22,800 --> 04:31:26,400
We are again taking the total
age of the older followers.

6350
04:31:26,400 --> 04:31:28,994
And then we are trying
to divide it by count

6351
04:31:28,994 --> 04:31:30,377
to get the average age

6352
04:31:30,377 --> 04:31:33,900
when at last we are trying
to display the result and then

6353
04:31:33,900 --> 04:31:35,600
we are stopping this park.

6354
04:31:35,600 --> 04:31:38,385
So let me quickly open
the terminal so I

6355
04:31:38,385 --> 04:31:39,742
will go to examples

6356
04:31:39,742 --> 04:31:43,600
so I'd examples I took you
through the source directory

6357
04:31:43,600 --> 04:31:46,400
where the code is
present inside skaila.

6358
04:31:46,400 --> 04:31:49,154
And then inside there
is a spark directory

6359
04:31:49,154 --> 04:31:51,975
where you will find
the code but to execute

6360
04:31:51,975 --> 04:31:55,200
the example you need to go
to the jars territory.

6361
04:31:56,100 --> 04:31:58,392
Now, this is
the scale example jar

6362
04:31:58,392 --> 04:32:00,200
which you need to execute.

6363
04:32:00,200 --> 04:32:03,100
But before this,
let me take you to the hdfs.

6364
04:32:03,400 --> 04:32:05,600
So the URL is localhost.

6365
04:32:05,600 --> 04:32:07,400
Colon 5 0 0 7 0

6366
04:32:08,500 --> 04:32:10,800
And we'll go to utilities then

6367
04:32:10,800 --> 04:32:12,800
we'll go to browse
the file system.

6368
04:32:13,000 --> 04:32:14,137
So as you can see,

6369
04:32:14,137 --> 04:32:16,849
I have created a user
directory in which I

6370
04:32:16,849 --> 04:32:18,700
have specified the username.

6371
04:32:18,700 --> 04:32:22,000
That is Ed Eureka
and inside Ed Eureka.

6372
04:32:22,000 --> 04:32:24,200
I have placed my data directory

6373
04:32:24,200 --> 04:32:27,500
where we have this graphics
and inside the graphics.

6374
04:32:27,500 --> 04:32:30,100
We have both the file
that is followers Dot txt

6375
04:32:30,100 --> 04:32:31,600
and users dot txt.

6376
04:32:31,600 --> 04:32:32,854
So in this program,

6377
04:32:32,854 --> 04:32:35,100
we are not referring
to these files

6378
04:32:35,100 --> 04:32:38,500
but incoming examples will
be referring to these files.

6379
04:32:38,500 --> 04:32:42,700
So I would request you to first
move it to this hdfs directory.

6380
04:32:42,700 --> 04:32:46,800
So that spark can refer
the files in data Graphics.

6381
04:32:47,000 --> 04:32:50,300
Now, let me quickly minimize
this and the command

6382
04:32:50,300 --> 04:32:53,000
to execute is Spock -

6383
04:32:53,000 --> 04:32:56,900
submit and then I'll pass
this charge parameter

6384
04:32:56,900 --> 04:32:59,900
and I'll provide
the spark example jar.

6385
04:33:01,200 --> 04:33:05,100
So this is the jar then
I'll specify the class name.

6386
04:33:05,100 --> 04:33:06,900
So to get the class name.

6387
04:33:06,900 --> 04:33:08,900
I will go to the code.

6388
04:33:09,200 --> 04:33:12,000
I'll first take
the package name from here.

6389
04:33:12,700 --> 04:33:14,100
And then I'll take

6390
04:33:14,100 --> 04:33:17,935
the class name which is
aggregated messages example,

6391
04:33:17,935 --> 04:33:19,400
so this is my class.

6392
04:33:19,400 --> 04:33:21,928
And as I told you have
to provide the name

6393
04:33:21,928 --> 04:33:23,100
of the application.

6394
04:33:23,100 --> 04:33:26,600
So let me keep it as example
and I'll hit enter.

6395
04:33:31,946 --> 04:33:34,253
So now you can see the result.

6396
04:33:36,000 --> 04:33:37,700
So this is the followers

6397
04:33:37,700 --> 04:33:40,500
and this is the average
age of followers.

6398
04:33:40,500 --> 04:33:41,827
So it is 34 Den.

6399
04:33:41,827 --> 04:33:45,038
We have 52 which is
the count of follower.

6400
04:33:45,038 --> 04:33:48,500
And the average age is
seventy six point eight

6401
04:33:48,500 --> 04:33:51,100
that is it has
96 senior followers.

6402
04:33:51,100 --> 04:33:52,900
And then the average age

6403
04:33:52,900 --> 04:33:56,000
of the followers is
ninety nine point zero,

6404
04:33:56,100 --> 04:33:58,600
then it has
four senior followers

6405
04:33:58,600 --> 04:34:00,520
and the average age is 51.

6406
04:34:00,520 --> 04:34:03,400
Then this vertex has
16 senior followers

6407
04:34:03,400 --> 04:34:06,003
with the average age
of 57 point five.

6408
04:34:06,003 --> 04:34:09,024
5 and so on you can see
the result over here.

6409
04:34:09,024 --> 04:34:12,800
So I hope now you guys are clear
with aggregate messages

6410
04:34:12,800 --> 04:34:14,748
how to use aggregate messages

6411
04:34:14,748 --> 04:34:17,100
how to specify
the send message then

6412
04:34:17,100 --> 04:34:19,200
how to write the merge message.

6413
04:34:19,200 --> 04:34:21,788
So let's quickly go back
to the presentation.

6414
04:34:21,788 --> 04:34:23,500
Now, let us quickly move ahead

6415
04:34:23,500 --> 04:34:26,014
and look at some
of the graph algorithms.

6416
04:34:26,014 --> 04:34:27,959
So the first one is Page rank.

6417
04:34:27,959 --> 04:34:31,200
So page rank measures
the importance of each vertex

6418
04:34:31,200 --> 04:34:32,706
in a graph assuming

6419
04:34:32,800 --> 04:34:35,900
that an edge from U
to V represents.

6420
04:34:36,000 --> 04:34:37,453
And recommendation

6421
04:34:37,453 --> 04:34:41,300
or support of Vis importance
by you for an example.

6422
04:34:41,300 --> 04:34:45,468
Let's say if a Twitter user
is followed by many others user

6423
04:34:45,468 --> 04:34:48,200
will obviously rank
high graphics comes

6424
04:34:48,200 --> 04:34:51,919
with the static and dynamic
implementation of pagerank as

6425
04:34:51,919 --> 04:34:53,780
methods on page rank object

6426
04:34:53,780 --> 04:34:57,500
and static page rank runs
a fixed number of iterations,

6427
04:34:57,500 --> 04:35:02,200
which can be specified by you
while the dynamic page rank runs

6428
04:35:02,200 --> 04:35:04,100
until the ranks converge

6429
04:35:04,500 --> 04:35:08,300
what we mean by that is
it Stop changing by more

6430
04:35:08,300 --> 04:35:10,400
than a specified tolerance.

6431
04:35:10,500 --> 04:35:11,300
So it runs

6432
04:35:11,300 --> 04:35:14,500
until it have optimized
the page rank of each

6433
04:35:14,500 --> 04:35:19,400
of the vertices now graphs class
allows calling these algorithms

6434
04:35:19,400 --> 04:35:22,100
directly as methods
on crafts class.

6435
04:35:22,200 --> 04:35:24,800
Now, let's quickly go
back to the VM.

6436
04:35:25,000 --> 04:35:27,469
So this is the pagerank example.

6437
04:35:27,469 --> 04:35:29,161
Let me open this file.

6438
04:35:29,600 --> 04:35:32,595
So first we are specifying
this Graphics package,

6439
04:35:32,595 --> 04:35:35,065
then we are importing
the graph loader.

6440
04:35:35,065 --> 04:35:37,600
So as you can Remember
inside this graph

6441
04:35:37,600 --> 04:35:41,000
loader class we have
that edge list file operator,

6442
04:35:41,000 --> 04:35:43,600
which will basically create
the graph using the edges

6443
04:35:43,600 --> 04:35:46,575
and we have those edges
in our followers

6444
04:35:46,575 --> 04:35:50,542
dot txt file now coming back
to pagerank example now,

6445
04:35:50,542 --> 04:35:53,900
we're importing the spark
SQL Sparks session.

6446
04:35:54,100 --> 04:35:56,619
Now, this is Page
rank example object

6447
04:35:56,619 --> 04:35:59,700
and inside which we
have created a main class

6448
04:35:59,700 --> 04:36:04,000
and we have similarly created
this park session then Builders

6449
04:36:04,000 --> 04:36:05,600
and we're specifying
the app name

6450
04:36:05,600 --> 04:36:09,800
which Is to be provided then
we have get our grid method.

6451
04:36:09,800 --> 04:36:10,415
So this is

6452
04:36:10,415 --> 04:36:12,800
where we are initializing
the spark context

6453
04:36:12,800 --> 04:36:13,800
as you can remember.

6454
04:36:13,800 --> 04:36:16,900
I told you that using
this Edge list file method.

6455
04:36:16,900 --> 04:36:19,115
We are basically
creating the graph

6456
04:36:19,115 --> 04:36:21,200
from the followers dot txt file.

6457
04:36:21,200 --> 04:36:24,223
Now, we are running
the page rank over here.

6458
04:36:24,223 --> 04:36:28,421
So in rank it will give you all
the page rank of the vertices

6459
04:36:28,421 --> 04:36:30,104
that is inside this graph

6460
04:36:30,104 --> 04:36:33,400
which we have just
to reducing graph loader class.

6461
04:36:33,400 --> 04:36:36,575
So if you're passing
an integer as an an argument

6462
04:36:36,575 --> 04:36:37,700
to the page rank,

6463
04:36:37,700 --> 04:36:40,018
it will run
that number iterations.

6464
04:36:40,018 --> 04:36:43,000
Otherwise, if you're
passing a double value,

6465
04:36:43,000 --> 04:36:45,495
it will run
until the convergence.

6466
04:36:45,495 --> 04:36:48,400
So we are running
page rank on this graph

6467
04:36:48,400 --> 04:36:50,861
and we have passed the vertices.

6468
04:36:50,900 --> 04:36:55,300
Now after this we are trying
to load the users dot txt file

6469
04:36:55,500 --> 04:36:58,400
and then we are trying to play

6470
04:36:58,400 --> 04:37:02,400
the line by comma then
the field zero too long

6471
04:37:02,400 --> 04:37:04,571
and we are storing
the field one.

6472
04:37:04,571 --> 04:37:06,200
So basically field zero.

6473
04:37:06,300 --> 04:37:09,376
In your user txt is
your vertex ID or you

6474
04:37:09,376 --> 04:37:13,790
can see the ID of the user
and field one is your username.

6475
04:37:13,790 --> 04:37:17,252
So we are trying to load
these two Fields now.

6476
04:37:17,280 --> 04:37:19,819
We are trying
to rank by username.

6477
04:37:19,969 --> 04:37:24,600
So we are taking the users
and we are joining the ranks.

6478
04:37:24,600 --> 04:37:28,000
So this is where we
are using the join operation.

6479
04:37:28,000 --> 04:37:29,670
So Frank's by username.

6480
04:37:29,670 --> 04:37:32,562
We are trying to
attach those username

6481
04:37:32,562 --> 04:37:35,793
or put those username
with the page rank value.

6482
04:37:35,793 --> 04:37:37,641
So we are taking the users

6483
04:37:37,641 --> 04:37:40,554
then we are joining
the ranks it is again,

6484
04:37:40,554 --> 04:37:42,900
we are getting
from this page Rank

6485
04:37:43,300 --> 04:37:47,700
and then we are mapping
the ID user name and rank.

6486
04:37:56,500 --> 04:38:00,517
Second week sometime run
some iterations over the craft

6487
04:38:00,517 --> 04:38:02,600
and will try to converge it.

6488
04:38:08,000 --> 04:38:11,700
So after converging you
can see the user and the rank.

6489
04:38:11,700 --> 04:38:14,300
So the maximum rank is
with Barack Obama,

6490
04:38:14,300 --> 04:38:18,000
which is 1.45 then
with Lady Gaga.

6491
04:38:18,100 --> 04:38:22,200
It's 1.39 and then with
order ski and so on.

6492
04:38:22,261 --> 04:38:24,338
Let's go back to the slide.

6493
04:38:25,200 --> 04:38:27,000
So now after page rank,

6494
04:38:27,200 --> 04:38:28,856
let's quickly move ahead

6495
04:38:28,856 --> 04:38:32,200
to Connected components
the connected components

6496
04:38:32,200 --> 04:38:34,923
algorithm labels each
connected component

6497
04:38:34,923 --> 04:38:38,600
of the graph with the ID
of its lowest numbered vertex.

6498
04:38:38,600 --> 04:38:40,700
So let us quickly go
back to the VM.

6499
04:38:42,000 --> 04:38:45,200
Now let's go inside
the graphics directory

6500
04:38:45,200 --> 04:38:48,300
and now we'll open
this connect components example.

6501
04:38:48,400 --> 04:38:51,818
So again, it's the same very
important graph load

6502
04:38:51,818 --> 04:38:53,100
and Spark session.

6503
04:38:53,300 --> 04:38:56,600
Now, this is the connect
components example object makes

6504
04:38:56,600 --> 04:39:00,176
this is the main function
and inside the main function.

6505
04:39:00,176 --> 04:39:01,800
We are again specifying all

6506
04:39:01,800 --> 04:39:04,500
those Sparks session
then app name,

6507
04:39:04,500 --> 04:39:06,389
then we have spark context.

6508
04:39:06,389 --> 04:39:07,509
So it's similar.

6509
04:39:07,509 --> 04:39:10,100
So again using
this graph loader class

6510
04:39:10,130 --> 04:39:11,669
and using this Edge.

6511
04:39:11,900 --> 04:39:15,700
To file we are loading
the followers dot txt file.

6512
04:39:15,700 --> 04:39:16,733
Now in this graph.

6513
04:39:16,733 --> 04:39:19,706
We are using this connected
components algorithm.

6514
04:39:19,706 --> 04:39:23,300
And then we are trying to find
the connected components now

6515
04:39:23,300 --> 04:39:26,600
at last we are trying
to again load this user file

6516
04:39:26,600 --> 04:39:28,300
that is users Dot txt.

6517
04:39:28,500 --> 04:39:31,312
And we are trying to join
the connected components

6518
04:39:31,312 --> 04:39:34,387
with the username so over
here it is also the same thing

6519
04:39:34,387 --> 04:39:36,504
which we have discussed
in page rank,

6520
04:39:36,504 --> 04:39:38,000
which is taking the field 0

6521
04:39:38,000 --> 04:39:41,100
and field one
of your user dot txt file

6522
04:39:41,400 --> 04:39:45,100
and a at last we
are joining this users

6523
04:39:45,100 --> 04:39:49,200
and at last year trying to join
this users to connect component

6524
04:39:49,200 --> 04:39:50,584
that is from here.

6525
04:39:50,584 --> 04:39:50,882
Now.

6526
04:39:50,882 --> 04:39:54,008
We are printing the CC
by username collect.

6527
04:39:54,008 --> 04:39:58,400
So let us quickly go ahead and
execute this example as well.

6528
04:39:58,600 --> 04:40:01,400
So let me first copy
this object name.

6529
04:40:03,800 --> 04:40:17,300
that's name this
as example to so

6530
04:40:17,300 --> 04:40:20,100
as you can see Justin Bieber has
one connected component,

6531
04:40:20,100 --> 04:40:23,300
then you can see this has
three connected component.

6532
04:40:23,300 --> 04:40:25,100
Then this has
one connected component

6533
04:40:25,100 --> 04:40:28,600
than Barack Obama has one
connected component and so on.

6534
04:40:28,600 --> 04:40:30,464
So this basically
gives you an idea

6535
04:40:30,464 --> 04:40:32,200
about the connected components.

6536
04:40:32,200 --> 04:40:33,900
Now, let's quickly move back

6537
04:40:33,900 --> 04:40:37,300
to the slide will discuss
about the third algorithm

6538
04:40:37,300 --> 04:40:39,100
that is triangle counting.

6539
04:40:39,100 --> 04:40:43,177
So basically a Vertex is a part
of a triangle when it has

6540
04:40:43,177 --> 04:40:46,900
two adjacent vertices
with an edge between them.

6541
04:40:46,900 --> 04:40:49,100
So it will form
a triangle, right?

6542
04:40:49,100 --> 04:40:52,313
And then that vertex
is a part of a triangle

6543
04:40:52,313 --> 04:40:56,092
now Graphics implements
a triangle counting algorithm

6544
04:40:56,092 --> 04:40:58,200
in the Triangle count object.

6545
04:40:58,200 --> 04:41:01,200
Now that determines the number
of triangles passing

6546
04:41:01,200 --> 04:41:04,600
through each vertex providing
a measure of clustering

6547
04:41:04,600 --> 04:41:07,400
so we can compute
the triangle count

6548
04:41:07,400 --> 04:41:09,875
of the social network data set

6549
04:41:09,875 --> 04:41:13,675
from the pagerank section
1 mode thing to note is

6550
04:41:13,675 --> 04:41:16,598
that triangle count
requires the edges.

6551
04:41:16,600 --> 04:41:18,800
To be in
a canonical orientation.

6552
04:41:18,800 --> 04:41:21,364
That is your Source ID
should always be less

6553
04:41:21,364 --> 04:41:22,868
than your destination ID

6554
04:41:22,868 --> 04:41:25,500
and the graph will be
partition using craft

6555
04:41:25,500 --> 04:41:27,318
or Partition by Method now,

6556
04:41:27,318 --> 04:41:28,800
let's quickly go back.

6557
04:41:28,800 --> 04:41:32,000
So let me open
the graphics directory again,

6558
04:41:32,000 --> 04:41:35,200
and we'll see
the triangle counting example.

6559
04:41:36,500 --> 04:41:38,100
So again, it's the same

6560
04:41:38,100 --> 04:41:40,900
and the object is
triangle counting example,

6561
04:41:40,900 --> 04:41:43,400
then the main function
is same as well.

6562
04:41:43,400 --> 04:41:46,400
Now we are again using
this graph load of class

6563
04:41:46,400 --> 04:41:50,183
and we are loading
the followers dot txt

6564
04:41:50,183 --> 04:41:52,000
which contains the edges

6565
04:41:52,000 --> 04:41:53,000
as you can see here.

6566
04:41:53,000 --> 04:41:54,600
We are using this Partition

6567
04:41:54,600 --> 04:41:58,800
by argument and we are passing
the random vertex cut,

6568
04:41:58,800 --> 04:42:01,000
which is the partition strategy.

6569
04:42:01,000 --> 04:42:03,165
So this is how you can go ahead

6570
04:42:03,165 --> 04:42:06,100
and you can Implement
a partition strategy.

6571
04:42:06,123 --> 04:42:09,277
He is loading the edges
in canonical order

6572
04:42:09,400 --> 04:42:11,900
and partitioning the graph
for triangle count.

6573
04:42:11,900 --> 04:42:12,129
Now.

6574
04:42:12,129 --> 04:42:14,600
We are trying to find
out the triangle count

6575
04:42:14,600 --> 04:42:15,830
for each vertex.

6576
04:42:15,830 --> 04:42:18,000
So we have this try count

6577
04:42:18,000 --> 04:42:22,600
variable and then we are using
this triangle count algorithm

6578
04:42:22,600 --> 04:42:25,074
and then we are
specifying the vertices

6579
04:42:25,074 --> 04:42:28,200
so it will execute
triangle count over this graph

6580
04:42:28,200 --> 04:42:31,900
which we have just loaded
from follows dot txt file.

6581
04:42:31,900 --> 04:42:35,074
And again, we are basically
joining usernames.

6582
04:42:35,074 --> 04:42:38,320
So first we are Being
the usernames again here.

6583
04:42:38,320 --> 04:42:42,600
We are performing the join
between users and try counts.

6584
04:42:42,900 --> 04:42:45,300
So try counts is from here.

6585
04:42:45,300 --> 04:42:48,806
And then we are again
printing the value from here.

6586
04:42:48,806 --> 04:42:50,700
So again, this is the same.

6587
04:42:50,700 --> 04:42:52,844
Let us quickly go
ahead and execute

6588
04:42:52,844 --> 04:42:54,800
this triangle counting example.

6589
04:42:54,800 --> 04:42:56,338
So let me copy this.

6590
04:42:56,500 --> 04:42:58,300
I'll go back to the terminal.

6591
04:42:58,400 --> 04:43:02,300
I'll limit as example
3 and change the class name.

6592
04:43:04,134 --> 04:43:05,365
And I hit enter.

6593
04:43:14,100 --> 04:43:16,900
So now you can see
the triangle associated

6594
04:43:16,900 --> 04:43:20,100
with Justin Bieber 0 then
Barack Obama is one

6595
04:43:20,100 --> 04:43:21,600
with odors kids one

6596
04:43:21,661 --> 04:43:23,200
and with Jerry sick.

6597
04:43:23,200 --> 04:43:24,100
It's fun.

6598
04:43:24,300 --> 04:43:27,800
So for better understanding I
would recommend you to go ahead

6599
04:43:27,800 --> 04:43:30,136
and take this followers or txt.

6600
04:43:30,136 --> 04:43:33,000
And you can create
a graph by yourself.

6601
04:43:33,000 --> 04:43:36,227
And then you can attach
these users names with them

6602
04:43:36,227 --> 04:43:38,100
and then you will get an idea

6603
04:43:38,100 --> 04:43:41,700
about why it is giving
the number as 1 or 0.

6604
04:43:41,700 --> 04:43:44,065
So again the graph
which is connecting.

6605
04:43:44,065 --> 04:43:45,000
In two and four

6606
04:43:45,000 --> 04:43:47,600
is disconnect and it
is not completing any triangles.

6607
04:43:47,600 --> 04:43:52,900
So the value of these 3 are 0
and next year's second graph

6608
04:43:52,900 --> 04:43:54,600
which is connecting

6609
04:43:54,600 --> 04:43:59,400
your vertex 3 6 andamp; 7
is completing one triangle.

6610
04:43:59,400 --> 04:44:01,323
So this is the reason why

6611
04:44:01,323 --> 04:44:05,300
these three vertices
have values one now.

6612
04:44:05,400 --> 04:44:06,952
Let me quickly go back.

6613
04:44:06,952 --> 04:44:07,875
So now I hope

6614
04:44:07,875 --> 04:44:11,000
that you guys are clear
with all the concepts

6615
04:44:11,000 --> 04:44:14,011
of graph operators
then graph algorithms.

6616
04:44:14,011 --> 04:44:17,400
Eames so now is the right
time and let us look

6617
04:44:17,400 --> 04:44:19,200
at a spa Graphics demo

6618
04:44:19,300 --> 04:44:20,838
where we'll go ahead

6619
04:44:20,838 --> 04:44:24,300
and we'll try to analyze
the force go by data.

6620
04:44:24,800 --> 04:44:27,800
So let me quickly go
back to my VM.

6621
04:44:28,000 --> 04:44:29,699
So let me first show
you the website

6622
04:44:29,699 --> 04:44:32,500
where you can go ahead and
download the Fords go by data.

6623
04:44:38,600 --> 04:44:40,350
So over here you can go

6624
04:44:40,350 --> 04:44:43,700
to download the fort
bike strip history data.

6625
04:44:46,480 --> 04:44:51,019
So you can go ahead and download
this 2017 Ford's trip data.

6626
04:44:51,100 --> 04:44:53,000
So I have already downloaded it.

6627
04:44:55,300 --> 04:44:56,696
So to avoid the typos,

6628
04:44:56,696 --> 04:44:59,300
I have already written
all the commands so

6629
04:44:59,300 --> 04:45:07,100
first let me go ahead and start
the spark shell So I'm inside

6630
04:45:07,100 --> 04:45:09,700
these Park shell now.

6631
04:45:09,700 --> 04:45:13,300
Let me first import graphics
and Spa body.

6632
04:45:15,800 --> 04:45:19,200
So I've successfully
imported graphics and Spark rdd.

6633
04:45:20,180 --> 04:45:23,719
Now, let me create
a spark SQL context as well.

6634
04:45:25,100 --> 04:45:28,900
So I have successfully
created this park SQL context.

6635
04:45:28,900 --> 04:45:31,520
So this is basically
for running SQL queries

6636
04:45:31,520 --> 04:45:32,800
over the data frames.

6637
04:45:34,100 --> 04:45:37,176
Now, let me go ahead
and import the data.

6638
04:45:37,826 --> 04:45:40,673
So I'm loading the data
in data frame.

6639
04:45:40,800 --> 04:45:43,623
So the format of file is CSV,

6640
04:45:43,623 --> 04:45:46,853
then an option the header
is already added.

6641
04:45:46,853 --> 04:45:48,700
So that's why it's true.

6642
04:45:48,800 --> 04:45:51,600
Then it will automatically
infer this schema

6643
04:45:51,600 --> 04:45:53,332
and then in the load parameter,

6644
04:45:53,332 --> 04:45:55,400
I have specified
the path of the file.

6645
04:45:55,400 --> 04:45:57,100
So I'll quickly hit enter.

6646
04:45:59,100 --> 04:46:02,500
So the data is loaded
in the data frame to check.

6647
04:46:02,500 --> 04:46:07,000
I'll use d f dot count
so it will give me the count.

6648
04:46:09,900 --> 04:46:16,553
So you can see it has
5 lakhs 19 2007 Red Rose now.

6649
04:46:16,553 --> 04:46:20,092
Let me click go back
and I'll print the schema.

6650
04:46:21,400 --> 04:46:25,010
So this is the schema
the duration in second,

6651
04:46:25,010 --> 04:46:27,625
then we have
the start time end time.

6652
04:46:27,625 --> 04:46:29,876
Then you have start station ID.

6653
04:46:29,876 --> 04:46:32,200
Then you have
start station name.

6654
04:46:32,300 --> 04:46:35,761
Then you have start
station latitude longitude

6655
04:46:35,761 --> 04:46:37,207
then end station ID

6656
04:46:37,207 --> 04:46:40,360
and station name then
end station latitude

6657
04:46:40,360 --> 04:46:42,007
and station longitude.

6658
04:46:42,007 --> 04:46:46,500
Then your bike ID user type then
the birth year of the member

6659
04:46:46,500 --> 04:46:48,650
and the gender
of the member now,

6660
04:46:48,650 --> 04:46:50,800
I'm trying to create
a data frame

6661
04:46:50,800 --> 04:46:52,306
that is Gas stations

6662
04:46:52,306 --> 04:46:56,300
so it will only create
the station ID and station name

6663
04:46:56,300 --> 04:46:58,607
which I'll be using as vertex.

6664
04:46:58,800 --> 04:47:02,000
So here I am trying
to create a data frame

6665
04:47:02,000 --> 04:47:03,500
with the name of just stations

6666
04:47:03,658 --> 04:47:07,120
where I am just selecting
the start station ID

6667
04:47:07,120 --> 04:47:09,600
and I'm casting it as float

6668
04:47:09,600 --> 04:47:12,400
and then I'm selecting
the start station name

6669
04:47:12,400 --> 04:47:15,400
and then I'm using
the distinct function to only

6670
04:47:15,400 --> 04:47:17,169
keep the unique values.

6671
04:47:17,169 --> 04:47:19,864
So I quickly go
ahead and hit enter.

6672
04:47:20,100 --> 04:47:21,600
So again, let me go

6673
04:47:21,600 --> 04:47:27,000
ahead and use this just stations
and I will print the schema.

6674
04:47:28,300 --> 04:47:31,531
So you can see
there is station ID,

6675
04:47:31,531 --> 04:47:34,000
and then there is
start station name.

6676
04:47:34,569 --> 04:47:36,800
It contains the unique values

6677
04:47:36,800 --> 04:47:40,600
of stations in this just
station data frame.

6678
04:47:40,800 --> 04:47:41,735
So now again,

6679
04:47:41,735 --> 04:47:44,900
I am taking this stations
where I'm selecting

6680
04:47:44,900 --> 04:47:47,971
these thought station ID
and and station ID.

6681
04:47:47,971 --> 04:47:49,900
Then I am using re distinct

6682
04:47:49,900 --> 04:47:52,700
which will again give
me the unique values

6683
04:47:52,700 --> 04:47:54,600
and I'm using this flat map

6684
04:47:54,600 --> 04:47:56,200
where I am specifying

6685
04:47:56,200 --> 04:47:59,700
the iterables where we
are taking the x0

6686
04:47:59,700 --> 04:48:01,700
that is your start station ID,

6687
04:48:01,700 --> 04:48:04,405
and I am taking x 1
which is your ends.

6688
04:48:04,405 --> 04:48:05,700
An ID and then again,

6689
04:48:05,700 --> 04:48:07,800
I'm applying this
distinct function

6690
04:48:07,800 --> 04:48:12,200
that it will keep only
the unique values and then

6691
04:48:12,400 --> 04:48:14,600
at last we have to d f function

6692
04:48:14,600 --> 04:48:16,619
which will convert
it to data frame.

6693
04:48:16,619 --> 04:48:19,100
So let me quickly go ahead
and execute this.

6694
04:48:19,500 --> 04:48:21,376
So I am printing this schema.

6695
04:48:21,376 --> 04:48:23,576
So as you can see
it has one column

6696
04:48:23,576 --> 04:48:26,100
that is value and it
has data type long.

6697
04:48:26,100 --> 04:48:29,715
So I have taken all
the start and end station ID

6698
04:48:29,715 --> 04:48:31,561
and using this flat map.

6699
04:48:31,561 --> 04:48:34,200
I have retreated
over all the start.

6700
04:48:34,200 --> 04:48:37,705
And and station ID and then
using the distinct function

6701
04:48:37,705 --> 04:48:41,600
and taking the unique values
and converting it to data frames

6702
04:48:41,600 --> 04:48:44,800
so I can use the stations
and using the station.

6703
04:48:44,800 --> 04:48:49,000
I will basically keep each
of the stations in a Vertex.

6704
04:48:49,000 --> 04:48:52,500
So this is the reason why
I'm taking the stations

6705
04:48:52,500 --> 04:48:55,300
or you can say I am taking
the unique stations

6706
04:48:55,300 --> 04:48:58,107
from the start station ID
and station ID

6707
04:48:58,107 --> 04:48:59,691
so that I can go ahead

6708
04:48:59,691 --> 04:49:02,500
and I can define
vertex as the stations.

6709
04:49:03,100 --> 04:49:06,400
So now we are creating
our set of vertices

6710
04:49:06,400 --> 04:49:09,804
and attaching a bit
of metadata to each one of them

6711
04:49:09,804 --> 04:49:12,800
which in our case is
the name of the station.

6712
04:49:12,800 --> 04:49:16,035
So as you can see we are
creating this station vertices,

6713
04:49:16,035 --> 04:49:18,679
which is again an rdd
with vertex ID and strength.

6714
04:49:18,679 --> 04:49:21,700
So we are using the station's
which we have just created.

6715
04:49:21,700 --> 04:49:24,500
We are joining it
with just stations

6716
04:49:24,500 --> 04:49:27,100
at the station value
should be equal

6717
04:49:27,100 --> 04:49:29,300
to just station station ID.

6718
04:49:29,600 --> 04:49:32,400
So as we have created stations,

6719
04:49:32,400 --> 04:49:35,200
And just station
so we are joining it.

6720
04:49:36,600 --> 04:49:39,061
And then selecting
the station ID

6721
04:49:39,061 --> 04:49:43,000
and start station name
then we are mapping row 0.

6722
04:49:44,700 --> 04:49:48,600
And Row 1 so your row
0 will basically be

6723
04:49:48,600 --> 04:49:51,088
your vertex ID and Row
1 will be the string.

6724
04:49:51,088 --> 04:49:55,100
That is the name of your station
to let me quickly go ahead

6725
04:49:55,100 --> 04:49:56,300
and execute this.

6726
04:49:57,357 --> 04:50:01,742
So let us quickly print this
using collect forage println.

6727
04:50:19,500 --> 04:50:20,366
So over here,

6728
04:50:20,366 --> 04:50:23,900
we are basically attaching
the edges or you can see we

6729
04:50:23,900 --> 04:50:27,500
are creating the trip edges
to all our individual rights

6730
04:50:27,500 --> 04:50:29,900
and then we'll get
the station values

6731
04:50:30,350 --> 04:50:33,350
and then we'll add
a dummy value of one.

6732
04:50:33,800 --> 04:50:34,900
So as you can see

6733
04:50:34,900 --> 04:50:37,200
that I am selecting
the start station and

6734
04:50:37,200 --> 04:50:38,600
and station from the DF

6735
04:50:38,600 --> 04:50:41,300
which is the first data frame
which we have loaded

6736
04:50:41,300 --> 04:50:46,200
and then I am mapping
it to row 0   Row 1,

6737
04:50:46,400 --> 04:50:49,000
which is your source
and destination.

6738
04:50:49,100 --> 04:50:53,500
And then and then I'm attaching
a value one to each one of them.

6739
04:50:53,600 --> 04:50:55,000
So I'll hit enter.

6740
04:50:57,500 --> 04:51:00,900
Now, let me quickly go ahead
and print this station edges.

6741
04:51:07,500 --> 04:51:10,300
So just taking the source
ID of the vertex

6742
04:51:10,300 --> 04:51:12,182
and destination ID of the vertex

6743
04:51:12,182 --> 04:51:14,800
or you can say so station ID
or vertex station ID

6744
04:51:14,800 --> 04:51:17,900
and it is attaching value
one to each one of them.

6745
04:51:17,900 --> 04:51:20,700
So now you can go ahead
and build your graph.

6746
04:51:20,700 --> 04:51:23,854
But again as we discuss
that we need a default station

6747
04:51:23,854 --> 04:51:25,700
so you can have some situations

6748
04:51:25,700 --> 04:51:29,033
where your edges might be
indicating some vertices,

6749
04:51:29,033 --> 04:51:31,500
but that vertices
might not be present

6750
04:51:31,500 --> 04:51:33,107
in your vertex re D.

6751
04:51:33,107 --> 04:51:34,764
So for that situation,

6752
04:51:34,764 --> 04:51:37,400
we need to create
a default station.

6753
04:51:37,400 --> 04:51:40,651
So I created a default station
as missing station.

6754
04:51:40,651 --> 04:51:42,100
So now we are all set.

6755
04:51:42,100 --> 04:51:44,400
We can go ahead
and create the graph.

6756
04:51:44,400 --> 04:51:46,700
So the name of the graph
is station graph.

6757
04:51:46,700 --> 04:51:49,000
Then the vertices
are stationed vertices

6758
04:51:49,000 --> 04:51:50,485
which we have created

6759
04:51:50,485 --> 04:51:54,247
which basically contains
the station ID and station name

6760
04:51:54,247 --> 04:51:56,300
and then we have station edges

6761
04:51:56,300 --> 04:51:58,600
and at last we
have default station.

6762
04:51:58,600 --> 04:52:01,500
So let me quickly go ahead
and execute this.

6763
04:52:03,100 --> 04:52:06,500
So now I need to cash this graph
for faster access.

6764
04:52:06,500 --> 04:52:08,700
So I'll use cash function.

6765
04:52:09,500 --> 04:52:13,300
So let us quickly go ahead and
check the number of vertices.

6766
04:52:24,700 --> 04:52:28,600
So these are the number
of vertices again,

6767
04:52:28,900 --> 04:52:31,600
we can check the number
of edges as well.

6768
04:52:35,700 --> 04:52:37,300
So these are
the number of edges.

6769
04:52:38,405 --> 04:52:40,400
And to get a sanity check.

6770
04:52:40,400 --> 04:52:43,500
So let's go ahead
and check the number of records

6771
04:52:43,500 --> 04:52:45,500
that are present
in the data frame.

6772
04:52:48,000 --> 04:52:50,900
So as you can see
that the number of edges

6773
04:52:50,900 --> 04:52:55,100
in our graph and the count
in our data frame is similar,

6774
04:52:55,100 --> 04:52:56,900
or you can see the same.

6775
04:52:56,900 --> 04:53:00,702
So now let's go ahead and run
page rank on our data

6776
04:53:00,702 --> 04:53:04,200
so we can either run
a set number of iterations

6777
04:53:04,200 --> 04:53:06,700
or we can run it
until the convergence.

6778
04:53:06,700 --> 04:53:10,400
So in my case,
I'll run it till convergence.

6779
04:53:11,700 --> 04:53:15,000
So it's rank then
station graph then page rank.

6780
04:53:15,000 --> 04:53:17,133
So has specified
the double value

6781
04:53:17,133 --> 04:53:21,000
so it will Tell convergence
so let's wait for some time.

6782
04:53:51,600 --> 04:53:55,400
So now that we have executed
the pagerank algorithm.

6783
04:53:55,700 --> 04:53:57,300
So we got the ranks

6784
04:53:57,300 --> 04:53:59,700
which are attached
to each vertices.

6785
04:54:00,100 --> 04:54:03,700
So now let us quickly go ahead
and look at the ranks.

6786
04:54:03,700 --> 04:54:06,601
So we are joining ranks
with station vertices

6787
04:54:06,601 --> 04:54:09,675
and then we have sorting it
in descending values

6788
04:54:09,675 --> 04:54:11,900
and we are taking
the first 10 rows

6789
04:54:11,900 --> 04:54:13,500
and then we are printing them.

6790
04:54:13,500 --> 04:54:16,700
So let's quickly go
ahead and hit enter.

6791
04:54:21,700 --> 04:54:26,000
So you can see these are
the top 10 stations which have

6792
04:54:26,000 --> 04:54:27,800
the most pagerank values

6793
04:54:27,800 --> 04:54:30,800
so you can say it has
more number of incoming trips.

6794
04:54:30,800 --> 04:54:32,270
Now one question would be

6795
04:54:32,270 --> 04:54:35,000
what are the most common
destinations in the data set

6796
04:54:35,000 --> 04:54:36,598
from location to location

6797
04:54:36,598 --> 04:54:40,500
so we can do this by performing
a grouping operator and adding

6798
04:54:40,500 --> 04:54:42,218
The Edge counts together.

6799
04:54:42,218 --> 04:54:46,000
So basically this will give
a new graph except each Edge

6800
04:54:46,000 --> 04:54:50,300
will now be the sum of all
the semantically same edges.

6801
04:54:51,500 --> 04:54:53,700
So again, we are taking
the station graph.

6802
04:54:53,700 --> 04:54:56,800
We are performing Group
by edges H1 and H2.

6803
04:54:56,800 --> 04:55:00,197
So we are basically
grouping edges H1 and H2.

6804
04:55:00,200 --> 04:55:01,629
So we are aggregating them.

6805
04:55:01,629 --> 04:55:03,100
Then we are using triplet

6806
04:55:03,100 --> 04:55:06,099
and then we are sorting them
in descending order again.

6807
04:55:06,099 --> 04:55:08,200
And then we are
printing the triplets

6808
04:55:08,200 --> 04:55:10,908
from The Source vertex
and the number of trips

6809
04:55:10,908 --> 04:55:13,864
and then we are taking
the destination attribute

6810
04:55:13,864 --> 04:55:15,500
or you can see destination

6811
04:55:15,500 --> 04:55:18,100
Vertex or you can see
destination station.

6812
04:55:26,526 --> 04:55:28,373
So you can see there are

6813
04:55:28,500 --> 04:55:32,300
1933 trips from San
Francisco Ferry Building

6814
04:55:32,300 --> 04:55:34,100
to the station then again,

6815
04:55:34,100 --> 04:55:36,700
you can see there are
fourteen hundred and eleven

6816
04:55:36,700 --> 04:55:39,900
trips from San Francisco
to this location.

6817
04:55:39,900 --> 04:55:42,200
Then there are 1 0 to 5 trips

6818
04:55:42,200 --> 04:55:45,300
from this station
to San Francisco

6819
04:55:45,500 --> 04:55:49,100
and it goes so on so now we
have got a directed graph

6820
04:55:49,100 --> 04:55:50,885
that mean our
trip are directional

6821
04:55:50,885 --> 04:55:52,400
from one location to another

6822
04:55:52,600 --> 04:55:55,787
so now we can go ahead
and find the number of Trades

6823
04:55:55,787 --> 04:55:57,725
that Went to a specific station

6824
04:55:57,725 --> 04:56:00,100
and then leave
from a specific station.

6825
04:56:00,100 --> 04:56:01,806
So basically we are trying

6826
04:56:01,806 --> 04:56:04,300
to find the inbound
and outbound values

6827
04:56:04,300 --> 04:56:07,829
or you can say we are trying
to find in degree and out degree

6828
04:56:07,829 --> 04:56:08,723
of the stations.

6829
04:56:08,723 --> 04:56:12,300
So let us first calculate the in
degrees from using station graph

6830
04:56:12,300 --> 04:56:14,364
and I am using
n degree operator.

6831
04:56:14,364 --> 04:56:17,298
Then I'm joining it
with the station vertices

6832
04:56:17,298 --> 04:56:20,435
and then I'm sorting it again
in descending order

6833
04:56:20,435 --> 04:56:22,852
and then I'm taking
the top 10 values.

6834
04:56:22,852 --> 04:56:25,400
So let's quickly go
ahead and hit enter.

6835
04:56:30,900 --> 04:56:34,815
So these are the top 10 station
and you can see the in degrees.

6836
04:56:34,815 --> 04:56:36,600
So there are these many trips

6837
04:56:36,600 --> 04:56:38,797
which are coming
into these stations.

6838
04:56:38,797 --> 04:56:39,651
Not similarly.

6839
04:56:39,651 --> 04:56:41,300
We can find the out degree.

6840
04:56:48,200 --> 04:56:51,400
Now again, you can see
the out degrees as well.

6841
04:56:51,400 --> 04:56:54,896
So these are the stations
and these are the out degrees.

6842
04:56:54,896 --> 04:56:58,439
So again, you can go ahead
and perform some more operations

6843
04:56:58,439 --> 04:56:59,400
over this graph.

6844
04:56:59,400 --> 04:57:01,635
So you can go ahead
and find the station

6845
04:57:01,635 --> 04:57:03,700
which has most number
of trips things

6846
04:57:03,700 --> 04:57:07,241
that is most number of people
coming into that station,

6847
04:57:07,241 --> 04:57:09,758
but less people are
leaving that station

6848
04:57:09,758 --> 04:57:13,320
and again on the contrary
you can find out the stations

6849
04:57:13,320 --> 04:57:15,538
where there are
more number of edges

6850
04:57:15,538 --> 04:57:18,240
or you can set trip
leaving those stations.

6851
04:57:18,240 --> 04:57:19,848
But there are less number

6852
04:57:19,848 --> 04:57:22,100
of trips coming
into those stations.

6853
04:57:22,100 --> 04:57:25,800
So I guess you guys are
now clear with Spa Graphics.

6854
04:57:25,800 --> 04:57:27,810
Then we discuss
the different types

6855
04:57:27,810 --> 04:57:29,398
of crops then moving ahead.

6856
04:57:29,398 --> 04:57:31,100
We discuss the
features of grafx.

6857
04:57:31,100 --> 04:57:33,675
They'll be discuss something
about property graph.

6858
04:57:33,675 --> 04:57:35,500
We understood what
is property graph

6859
04:57:35,500 --> 04:57:38,200
how you can create vertex
how you can create edges

6860
04:57:38,200 --> 04:57:40,800
how to use Vertex or DD H Rd D.

6861
04:57:40,800 --> 04:57:44,500
Then we looked at some of
the important vertex operations

6862
04:57:44,500 --> 04:57:48,500
and at last we understood some
of the graph algorithms.

6863
04:57:48,500 --> 04:57:51,349
So I guess now you
guys are clear about

6864
04:57:51,349 --> 04:57:53,600
how to work with Bob Graphics.

6865
04:57:58,300 --> 04:58:01,300
Today's video is
on Hadoop versus park.

6866
04:58:01,400 --> 04:58:04,683
Now as we know organizations
from different domains

6867
04:58:04,683 --> 04:58:07,400
are investing in big
data analytics today.

6868
04:58:07,400 --> 04:58:10,400
They're analyzing large
data sets to uncover

6869
04:58:10,400 --> 04:58:11,730
all hidden patterns

6870
04:58:11,730 --> 04:58:15,510
unknown correlations market
trends customer preferences

6871
04:58:15,510 --> 04:58:18,100
and other useful
business information.

6872
04:58:18,100 --> 04:58:20,800
Analogy of findings
are helping organizations

6873
04:58:20,800 --> 04:58:24,100
and more effective marketing
new Revenue opportunities

6874
04:58:24,100 --> 04:58:25,973
and better customer service

6875
04:58:25,973 --> 04:58:29,241
and they're trying
to get competitive advantages

6876
04:58:29,241 --> 04:58:30,947
over rival organizations

6877
04:58:30,947 --> 04:58:33,920
and other business benefits
and Apache spark

6878
04:58:33,920 --> 04:58:38,000
and Hadoop are the two of most
prominent Big Data Frameworks

6879
04:58:38,000 --> 04:58:41,289
and I see people often comparing
these two technologies

6880
04:58:41,289 --> 04:58:44,700
and that is what exactly
we're going to do in this video.

6881
04:58:44,700 --> 04:58:48,100
Now, we'll compare these two big
data Frame Works based

6882
04:58:48,100 --> 04:58:49,800
on on different parameters,

6883
04:58:49,800 --> 04:58:52,487
but first it is important
to get an overview

6884
04:58:52,487 --> 04:58:53,800
about what is Hadoop.

6885
04:58:53,800 --> 04:58:55,600
And what is Apache spark?

6886
04:58:55,600 --> 04:58:58,900
So let me just tell you a little
bit about Hadoop Hadoop is

6887
04:58:58,900 --> 04:59:00,200
a framework to store

6888
04:59:00,200 --> 04:59:04,200
and process large sets of data
across computer clusters

6889
04:59:04,200 --> 04:59:07,100
and Hadoop can scale
from single computer system

6890
04:59:07,100 --> 04:59:09,710
up to thousands
of commodity systems

6891
04:59:09,710 --> 04:59:11,500
that offer local storage

6892
04:59:11,500 --> 04:59:14,801
and compute power and Hadoop
is composed of modules

6893
04:59:14,801 --> 04:59:18,500
that work together to create
the entire Hadoop framework.

6894
04:59:18,500 --> 04:59:20,557
These are some of the components

6895
04:59:20,557 --> 04:59:23,254
that we have in the
entire Hadoop framework

6896
04:59:23,254 --> 04:59:24,800
or the Hadoop ecosystem.

6897
04:59:24,800 --> 04:59:27,500
For example, let
me tell you about hdfs,

6898
04:59:27,500 --> 04:59:30,856
which is the storage unit
of Hadoop yarn, which is

6899
04:59:30,856 --> 04:59:32,500
for resource management.

6900
04:59:32,500 --> 04:59:34,600
There are different
than a little tools

6901
04:59:34,600 --> 04:59:39,500
like Apache Hive Pig nosql
databases like Apache hbase.

6902
04:59:39,900 --> 04:59:40,900
Even Apache spark

6903
04:59:40,900 --> 04:59:43,893
and Apache Stone fits
in the Hadoop ecosystem

6904
04:59:43,893 --> 04:59:45,399
for processing big data

6905
04:59:45,399 --> 04:59:49,200
in real-time for ingesting data
we have Tools like Flume

6906
04:59:49,200 --> 04:59:52,082
and scoop flumist used
to ingest unstructured data

6907
04:59:52,082 --> 04:59:53,600
or semi-structured data

6908
04:59:53,600 --> 04:59:57,135
where scoop is used to ingest
structured data into hdfs.

6909
04:59:57,135 --> 04:59:59,900
If you want to learn more
about these tools,

6910
04:59:59,900 --> 05:00:01,470
you can go to Eddie rei'kas

6911
05:00:01,470 --> 05:00:04,000
YouTube channel and look
for Hadoop tutorial

6912
05:00:04,000 --> 05:00:06,600
where everything has
been explained in detail.

6913
05:00:06,600 --> 05:00:08,171
Now, let's move to spark

6914
05:00:08,171 --> 05:00:12,100
Apache spark is a lightning-fast
cluster Computing technology

6915
05:00:12,100 --> 05:00:14,400
that is designed
for fast computation.

6916
05:00:14,400 --> 05:00:18,223
The main feature of spark
is it's in memory clusters.

6917
05:00:18,223 --> 05:00:19,400
Esther Computing

6918
05:00:19,400 --> 05:00:23,482
that increases the processing
of speed of an application fog

6919
05:00:23,482 --> 05:00:27,100
perform similar operations
to that of Hadoop modules,

6920
05:00:27,100 --> 05:00:30,365
but it uses an in-memory
processing and optimizes

6921
05:00:30,365 --> 05:00:33,791
the steps the primary
difference between mapreduce

6922
05:00:33,791 --> 05:00:35,400
and Hadoop and Spark is

6923
05:00:35,400 --> 05:00:38,500
that mapreduce users
persistent storage

6924
05:00:38,500 --> 05:00:42,100
and Spark uses resilient
distributed data sets,

6925
05:00:42,100 --> 05:00:44,920
which is known as
rdds which resides

6926
05:00:44,920 --> 05:00:48,458
in memory the different
components and Sparkle.

6927
05:00:48,800 --> 05:00:52,000
The spark origin the spark
or is the base engine

6928
05:00:52,000 --> 05:00:53,600
for large-scale parallel

6929
05:00:53,600 --> 05:00:57,463
and distributed data processing
further additional libraries

6930
05:00:57,463 --> 05:01:01,100
which are built on top of
the core allow diverse workloads

6931
05:01:01,100 --> 05:01:02,381
for streaming SQL

6932
05:01:02,381 --> 05:01:06,000
and machine learning spark
or is also responsible

6933
05:01:06,000 --> 05:01:09,500
for memory management
and fault recovery scheduling

6934
05:01:09,500 --> 05:01:12,749
and distributed and monitoring
jobs and a cluster

6935
05:01:12,749 --> 05:01:16,000
and interacting with
the storage systems as well.

6936
05:01:16,100 --> 05:01:16,649
Next up.

6937
05:01:16,649 --> 05:01:18,300
We have spark streaming.

6938
05:01:18,300 --> 05:01:20,906
Spark streaming is
the component of spark

6939
05:01:20,906 --> 05:01:24,100
which is used to process
real-time streaming data.

6940
05:01:24,100 --> 05:01:25,822
It enables high throughput

6941
05:01:25,822 --> 05:01:29,600
and fault-tolerant stream
processing of live data streams.

6942
05:01:29,600 --> 05:01:33,500
We have Sparks equal spark
SQL is a new module in spark

6943
05:01:33,500 --> 05:01:36,800
which integrates relational
processing with Sparks

6944
05:01:36,800 --> 05:01:38,800
functional programming API.

6945
05:01:38,800 --> 05:01:41,700
It supports querying
data either via SQL

6946
05:01:41,700 --> 05:01:44,000
or via the hive query language.

6947
05:01:44,000 --> 05:01:46,381
For those of you
familiar with rdbms.

6948
05:01:46,381 --> 05:01:48,300
Spark sequel will be an easy.

6949
05:01:48,300 --> 05:01:51,637
Transition from your earlier
tools where you can extend

6950
05:01:51,637 --> 05:01:55,100
the boundaries of traditional
relational data processing.

6951
05:01:55,200 --> 05:02:00,092
Next up is Graphics Ralph X is
the spark API for graphs

6952
05:02:00,092 --> 05:02:02,400
and graph parallel computation

6953
05:02:02,400 --> 05:02:04,867
and thus it extends
the spark resilient

6954
05:02:04,867 --> 05:02:08,700
distributed data sets with a
resilient distributed property.

6955
05:02:08,700 --> 05:02:09,500
Graph.

6956
05:02:09,900 --> 05:02:13,000
Next is Park Emma lip
for machine learning

6957
05:02:13,000 --> 05:02:16,500
Emma lip stands for machine
learning library spark.

6958
05:02:16,500 --> 05:02:18,300
Emma live is used
to perform machine.

6959
05:02:18,400 --> 05:02:20,900
In learning in Apache spark now

6960
05:02:20,900 --> 05:02:24,200
since you've got an overview
of both these two Frameworks,

6961
05:02:24,200 --> 05:02:25,985
I believe that the ground

6962
05:02:25,985 --> 05:02:29,200
is all set to compare
Apache spark and Hadoop.

6963
05:02:29,200 --> 05:02:32,617
Let's move ahead and compare
Apache spark with Hadoop

6964
05:02:32,617 --> 05:02:36,100
on different parameters
to understand their strengths.

6965
05:02:36,100 --> 05:02:38,887
We will be comparing
these two Frameworks

6966
05:02:38,887 --> 05:02:40,700
based on these parameters.

6967
05:02:40,700 --> 05:02:44,400
Let's start with performance
first Spark is fast

6968
05:02:44,400 --> 05:02:45,476
because it has

6969
05:02:45,476 --> 05:02:49,000
in-memory processing it
can also use For data,

6970
05:02:49,000 --> 05:02:51,774
that doesn't fit
into memory Sparks

6971
05:02:51,774 --> 05:02:55,851
in-memory processing delivers
near real-time analytics

6972
05:02:56,000 --> 05:02:57,771
and this makes Park suitable

6973
05:02:57,771 --> 05:03:00,300
for credit card
processing system machine

6974
05:03:00,300 --> 05:03:02,300
learning security analysis

6975
05:03:02,300 --> 05:03:05,100
and processing data
for iot sensors.

6976
05:03:05,200 --> 05:03:07,700
Now, let's talk
about hadoop's performance.

6977
05:03:07,700 --> 05:03:10,700
Now Hadoop has originally
designed to continuously

6978
05:03:10,700 --> 05:03:13,700
gather data from multiple
sources without worrying

6979
05:03:13,700 --> 05:03:14,800
about the type of data

6980
05:03:14,800 --> 05:03:15,687
and storing it

6981
05:03:15,687 --> 05:03:18,544
across distributed
environment and mapreduce.

6982
05:03:18,544 --> 05:03:22,185
Use uses batch processing
mapreduce was never built for

6983
05:03:22,185 --> 05:03:24,108
real-time processing main idea

6984
05:03:24,108 --> 05:03:27,751
behind yarn is parallel
processing over distributed data

6985
05:03:27,751 --> 05:03:30,400
set the problem
with comparing the two is

6986
05:03:30,400 --> 05:03:33,400
that they have different
way of processing

6987
05:03:33,400 --> 05:03:37,400
and the idea behind the
development is also Divergent

6988
05:03:37,700 --> 05:03:40,300
next ease-of-use spark comes

6989
05:03:40,300 --> 05:03:44,400
with a user-friendly apis
for Scala Java Python

6990
05:03:44,400 --> 05:03:48,300
and Sparks equal spark SQL
is very similar to SQL.

6991
05:03:48,600 --> 05:03:50,047
So it becomes easier

6992
05:03:50,047 --> 05:03:53,202
for a sequel developers
to learn it spark also

6993
05:03:53,202 --> 05:03:55,272
provides an interactive shell

6994
05:03:55,272 --> 05:03:58,700
for developers to query
and perform other actions

6995
05:03:58,700 --> 05:04:00,800
and have immediate feedback.

6996
05:04:00,900 --> 05:04:02,762
Now, let's talk about Hadoop.

6997
05:04:02,762 --> 05:04:06,544
You can ingest data in Hadoop
easily either by using shell

6998
05:04:06,544 --> 05:04:09,000
or integrating it
with multiple tools,

6999
05:04:09,000 --> 05:04:10,353
like scoop and Flume

7000
05:04:10,353 --> 05:04:13,021
and yarn is just
a processing framework

7001
05:04:13,021 --> 05:04:15,900
that can be integrated
with multiple tools

7002
05:04:15,900 --> 05:04:18,200
like Hive and pig for Analytics.

7003
05:04:18,200 --> 05:04:20,353
I visit data
warehousing component

7004
05:04:20,353 --> 05:04:22,381
which performs Reading Writing

7005
05:04:22,381 --> 05:04:26,058
and managing large data set
in a distributed environment

7006
05:04:26,058 --> 05:04:29,100
using sql-like interface
to conclude here.

7007
05:04:29,100 --> 05:04:31,700
Both of them have
their own ways to make

7008
05:04:31,700 --> 05:04:33,500
themselves user-friendly.

7009
05:04:33,826 --> 05:04:36,365
Now, let's come
to the cost Hadoop

7010
05:04:36,365 --> 05:04:39,903
and Spark are both Apache
open source projects.

7011
05:04:40,000 --> 05:04:43,900
So there's no cost for the
software cost is only associated

7012
05:04:43,900 --> 05:04:47,433
with the infrastructure both
the products are designed

7013
05:04:47,433 --> 05:04:48,300
in such a way

7014
05:04:48,300 --> 05:04:50,800
that Can run
on commodity Hardware

7015
05:04:50,800 --> 05:04:54,100
with low TCO or total
cost of ownership.

7016
05:04:54,800 --> 05:04:56,895
Well now you might
be wondering the ways

7017
05:04:56,895 --> 05:04:58,400
in which they are different.

7018
05:04:58,400 --> 05:05:02,117
They're all the same storage
and processing in Hadoop is

7019
05:05:02,117 --> 05:05:05,700
disc-based and Hadoop uses
standard amounts of memory.

7020
05:05:05,700 --> 05:05:06,717
So with Hadoop,

7021
05:05:06,717 --> 05:05:07,600
we need a lot

7022
05:05:07,600 --> 05:05:12,200
of disk space as well as
faster transfer speed Hadoop

7023
05:05:12,200 --> 05:05:15,300
also requires multiple
systems to distribute

7024
05:05:15,300 --> 05:05:17,000
the disk input output,

7025
05:05:17,000 --> 05:05:18,900
but in case of Apache spark

7026
05:05:18,900 --> 05:05:22,800
due to its in-memory processing
it requires a lot of memory,

7027
05:05:22,800 --> 05:05:24,900
but it can deal
with the standard.

7028
05:05:24,900 --> 05:05:28,400
Speed and amount of disk as
disk space is a relatively

7029
05:05:28,400 --> 05:05:29,855
inexpensive commodity

7030
05:05:29,855 --> 05:05:32,985
and since Park does not use
disk input output

7031
05:05:32,985 --> 05:05:34,591
for processing instead.

7032
05:05:34,591 --> 05:05:36,337
It requires large amounts

7033
05:05:36,337 --> 05:05:39,200
of RAM for executing
everything in memory.

7034
05:05:39,300 --> 05:05:42,000
So spark systems
incurs more cost

7035
05:05:42,300 --> 05:05:45,314
but yes one important thing
to keep in mind is

7036
05:05:45,314 --> 05:05:49,400
that Sparks technology reduces
the number of required systems,

7037
05:05:49,400 --> 05:05:52,900
it needs significantly
fewer systems that cost more

7038
05:05:52,900 --> 05:05:55,991
so there will be a point
at which spark reduces

7039
05:05:55,991 --> 05:05:57,134
the cost per unit

7040
05:05:57,134 --> 05:06:01,100
of the computation even with
the additional RAM requirement.

7041
05:06:01,200 --> 05:06:04,500
There are two types of
data processing batch processing

7042
05:06:04,500 --> 05:06:08,344
and stream processing batch
processing has been crucial

7043
05:06:08,344 --> 05:06:09,904
to the Big Data World

7044
05:06:09,904 --> 05:06:13,100
in simplest term batch
processing is working

7045
05:06:13,100 --> 05:06:16,500
with high data volumes
collected over a period

7046
05:06:16,500 --> 05:06:20,423
in batch processing data is
first collected then processed

7047
05:06:20,423 --> 05:06:21,800
and then the results

7048
05:06:21,800 --> 05:06:24,624
are produced at a later
stage and batch.

7049
05:06:24,624 --> 05:06:26,000
Is it efficient way

7050
05:06:26,000 --> 05:06:28,769
of processing large
static data sets?

7051
05:06:28,800 --> 05:06:30,300
Generally we perform

7052
05:06:30,300 --> 05:06:34,300
batch processing for archived
data sets for example,

7053
05:06:34,300 --> 05:06:36,887
calculating average income
of a country

7054
05:06:36,887 --> 05:06:40,700
or evaluating the change
in e-commerce in the last decade

7055
05:06:40,900 --> 05:06:45,000
now stream processing stream
processing is the current Trend

7056
05:06:45,000 --> 05:06:48,258
in the Big Data World need
of the hour is speed

7057
05:06:48,258 --> 05:06:50,100
and real-time information,

7058
05:06:50,100 --> 05:06:52,100
which is what stream processing

7059
05:06:52,100 --> 05:06:54,500
does batch processing
does not allow.

7060
05:06:54,500 --> 05:06:57,700
Businesses to quickly react
to changing business needs

7061
05:06:57,700 --> 05:07:01,900
and real-time stream processing
has seen a rapid growth

7062
05:07:01,900 --> 05:07:05,188
in that demand now coming
back to Apache Spark

7063
05:07:05,188 --> 05:07:09,420
versus Hadoop yarn is basically
a batch processing framework

7064
05:07:09,420 --> 05:07:11,500
when we submit a job to yarn.

7065
05:07:11,500 --> 05:07:14,827
It reads data from
the cluster performs operation

7066
05:07:14,827 --> 05:07:17,539
and write the results
back to the cluster

7067
05:07:17,539 --> 05:07:19,100
and then it again reads

7068
05:07:19,100 --> 05:07:21,900
the updated data performs
the next operation

7069
05:07:21,900 --> 05:07:25,500
and write the results back
to the cluster and Off

7070
05:07:25,700 --> 05:07:29,678
on the other hand spark is
designed to cover a wide range

7071
05:07:29,678 --> 05:07:31,100
of workloads such as

7072
05:07:31,100 --> 05:07:35,429
batch application iterative
algorithms interactive queries

7073
05:07:35,429 --> 05:07:37,100
and streaming as well.

7074
05:07:37,400 --> 05:07:40,899
Now, let's come to fault
tolerance Hadoop and Spark

7075
05:07:40,899 --> 05:07:43,000
both provides fault tolerance,

7076
05:07:43,000 --> 05:07:45,716
but have different
approaches for hdfs

7077
05:07:45,716 --> 05:07:47,673
and yarn both Master demons.

7078
05:07:47,673 --> 05:07:49,700
That is the name node in hdfs

7079
05:07:49,700 --> 05:07:53,285
and resource manager
in the arm checks the heartbeat

7080
05:07:53,285 --> 05:07:54,651
of the slave demons.

7081
05:07:54,651 --> 05:07:58,000
The slave demons are data nodes
and node managers.

7082
05:07:58,000 --> 05:08:00,100
So if any slave demon fails,

7083
05:08:00,100 --> 05:08:03,800
the master demons reschedules
all pending an in-progress

7084
05:08:03,800 --> 05:08:07,900
operations to another slave
now this method is effective

7085
05:08:07,900 --> 05:08:11,300
but it can significantly
increase the completion time

7086
05:08:11,300 --> 05:08:14,000
for operations with
single failure also

7087
05:08:14,000 --> 05:08:16,400
and as Hadoop uses
commodity hardware

7088
05:08:16,400 --> 05:08:20,200
and another way in which hdfs
ensures fault tolerance is

7089
05:08:20,200 --> 05:08:21,797
by replicating data.

7090
05:08:22,200 --> 05:08:24,200
Now let's talk about spark

7091
05:08:24,200 --> 05:08:29,094
as we discussed earlier rdds are
resilient distributed data sets

7092
05:08:29,094 --> 05:08:31,710
are building blocks
of Apache spark

7093
05:08:32,000 --> 05:08:34,100
and rdds are the one

7094
05:08:34,226 --> 05:08:37,073
which provide fault
tolerant to spark.

7095
05:08:37,073 --> 05:08:38,000
They can refer

7096
05:08:38,000 --> 05:08:41,600
to any data set present
and external storage system

7097
05:08:41,600 --> 05:08:45,200
like hdfs Edge base
shared file system Etc.

7098
05:08:45,300 --> 05:08:47,100
They can also be operated

7099
05:08:47,100 --> 05:08:49,869
parallely rdds can
persist a data set

7100
05:08:49,869 --> 05:08:52,100
and memory across operations.

7101
05:08:52,100 --> 05:08:56,061
It's which makes future actions
10 times much faster

7102
05:08:56,061 --> 05:08:58,731
if rdd is lost
it will automatically

7103
05:08:58,731 --> 05:09:02,700
get recomputed by using
the original Transformations.

7104
05:09:02,700 --> 05:09:06,720
And this is how spark provides
fault tolerance and at the end.

7105
05:09:06,720 --> 05:09:08,500
Let us talk about security.

7106
05:09:08,500 --> 05:09:11,100
Well Hadoop has
multiple ways of providing

7107
05:09:11,100 --> 05:09:14,806
security Hadoop supports
Kerberos for authentication,

7108
05:09:14,806 --> 05:09:17,800
but it is difficult
to handle nevertheless.

7109
05:09:17,800 --> 05:09:21,800
It also supports
third-party vendors like ldap.

7110
05:09:22,000 --> 05:09:23,441
For authentication,

7111
05:09:23,441 --> 05:09:26,400
they also offer
encryption hdfs supports

7112
05:09:26,400 --> 05:09:30,600
traditional file permissions as
well as Access Control lists,

7113
05:09:30,600 --> 05:09:34,222
Hadoop provides service level
authorization which guarantees

7114
05:09:34,222 --> 05:09:36,800
that clients have
the right permissions for

7115
05:09:36,800 --> 05:09:40,400
job submission spark currently
supports authentication

7116
05:09:40,400 --> 05:09:44,600
via a shared secret spark
can integrate with hdfs

7117
05:09:44,600 --> 05:09:46,900
and it can use hdfs ACLS

7118
05:09:46,900 --> 05:09:50,652
or Access Control lists
and file level permissions

7119
05:09:50,652 --> 05:09:52,024
sparking also run.

7120
05:09:52,024 --> 05:09:55,100
Yarn, leveraging the
capability of Kerberos.

7121
05:09:55,100 --> 05:09:55,900
Now.

7122
05:09:55,900 --> 05:09:59,100
This was the comparison
of these two Frameworks based

7123
05:09:59,100 --> 05:10:00,600
on these following parameters.

7124
05:10:00,600 --> 05:10:03,300
Now, let us understand use cases

7125
05:10:03,300 --> 05:10:06,900
where these Technologies
fit best use cases were

7126
05:10:06,900 --> 05:10:07,900
Hadoop fits best.

7127
05:10:07,900 --> 05:10:09,300
For example,

7128
05:10:09,300 --> 05:10:12,500
when you're analyzing
archive data yarn

7129
05:10:12,500 --> 05:10:14,300
allows parallel processing

7130
05:10:14,300 --> 05:10:18,657
over huge amounts of data parts
of data is processed parallely

7131
05:10:18,657 --> 05:10:21,300
and separately on
different data nodes

7132
05:10:21,300 --> 05:10:25,825
and gathers result
from each node manager in cases

7133
05:10:25,825 --> 05:10:29,000
when instant results
are not required now

7134
05:10:29,000 --> 05:10:32,319
Hadoop mapreduce is a good
and economical solution

7135
05:10:32,319 --> 05:10:33,700
for batch processing.

7136
05:10:33,700 --> 05:10:35,546
However, it is incapable

7137
05:10:35,900 --> 05:10:39,015
of processing data
in real-time use cases

7138
05:10:39,015 --> 05:10:43,400
where Spark fits best
in real-time Big Data analysis,

7139
05:10:43,400 --> 05:10:46,600
real-time data analysis
means processing data

7140
05:10:46,600 --> 05:10:50,300
that is getting generated by
the real-time event streams

7141
05:10:50,300 --> 05:10:53,000
coming in at the rate
of Billions of events

7142
05:10:53,000 --> 05:10:55,000
per second the strength

7143
05:10:55,000 --> 05:10:58,277
of spark lies in its abilities
to support streaming

7144
05:10:58,277 --> 05:11:00,900
of data along with
distributed processing

7145
05:11:00,900 --> 05:11:04,700
and Spark claims to process
data hundred times faster

7146
05:11:04,700 --> 05:11:09,100
than mapreduce while 10 times
faster with the discs.

7147
05:11:09,100 --> 05:11:13,000
It is used in graph
processing spark contains

7148
05:11:13,000 --> 05:11:15,720
a graph computation
Library called Graphics

7149
05:11:15,720 --> 05:11:18,700
which simplifies our life
in memory computation

7150
05:11:18,700 --> 05:11:22,100
along with inbuilt graph support
improves the performance.

7151
05:11:22,100 --> 05:11:24,700
Performance of algorithm
by a magnitude

7152
05:11:24,700 --> 05:11:28,516
of one or two degrees over
traditional mapreduce programs.

7153
05:11:28,516 --> 05:11:32,200
It is also used in iterative
machine learning algorithms

7154
05:11:32,200 --> 05:11:35,900
almost all machine learning
algorithms work iteratively

7155
05:11:35,900 --> 05:11:39,039
as we have seen earlier
iterative algorithms

7156
05:11:39,039 --> 05:11:41,389
involve input/output bottlenecks

7157
05:11:41,389 --> 05:11:44,400
in the mapreduce
implementations mapreduce

7158
05:11:44,400 --> 05:11:46,400
uses coarse-grained tasks

7159
05:11:46,400 --> 05:11:47,600
that are too heavy

7160
05:11:47,600 --> 05:11:51,926
for iterative algorithms spark
caches the intermediate data.

7161
05:11:51,926 --> 05:11:53,972
I said after each iteration

7162
05:11:53,972 --> 05:11:57,586
and runs multiple iterations
on the cache data set

7163
05:11:57,586 --> 05:12:01,200
which eventually reduces
the input output overhead

7164
05:12:01,200 --> 05:12:03,142
and executes the algorithm

7165
05:12:03,142 --> 05:12:07,400
faster in a fault-tolerant
manner sad the end which one is

7166
05:12:07,400 --> 05:12:10,900
the best the answer
to this is Hadoop

7167
05:12:10,900 --> 05:12:14,800
and Apache spark are
not competing with one another.

7168
05:12:15,000 --> 05:12:18,100
In fact, they complement
each other quite well,

7169
05:12:18,100 --> 05:12:20,745
how do brings huge
data sets under control

7170
05:12:20,745 --> 05:12:22,100
by commodity systems?

7171
05:12:22,100 --> 05:12:26,100
Systems and Spark provides
a real-time in-memory processing

7172
05:12:26,100 --> 05:12:27,700
for those data sets.

7173
05:12:27,900 --> 05:12:30,600
When we combine
Apache Sparks ability.

7174
05:12:30,600 --> 05:12:34,200
That is the high processing
speed and advanced analytics

7175
05:12:34,200 --> 05:12:38,600
and multiple integration support
with Hadoop slow cost operation

7176
05:12:38,600 --> 05:12:40,200
on commodity Hardware.

7177
05:12:40,200 --> 05:12:42,091
It gives the best results

7178
05:12:42,091 --> 05:12:45,800
Hadoop compliments Apache
spark capabilities spark

7179
05:12:45,800 --> 05:12:48,737
not completely replace a do
but the good news is

7180
05:12:48,737 --> 05:12:52,079
that the demand of spark is
currently at an all-time.

7181
05:12:52,079 --> 05:12:55,849
Hi, if you want to learn more
about the Hadoop ecosystem tools

7182
05:12:55,849 --> 05:12:56,900
and Apache spark,

7183
05:12:56,900 --> 05:12:59,106
don't forget to take
a look at the editor

7184
05:12:59,106 --> 05:13:01,700
Acres YouTube channel
and check out the big data

7185
05:13:01,700 --> 05:13:03,000
and Hadoop playlist.

7186
05:13:07,600 --> 05:13:09,776
Welcome everyone in
today's session on

7187
05:13:09,776 --> 05:13:11,100
kafka's Park streaming.

7188
05:13:11,100 --> 05:13:14,400
So without any further delay,
let's look at the agenda first.

7189
05:13:14,400 --> 05:13:16,128
We will start by understanding.

7190
05:13:16,128 --> 05:13:17,310
What is Apache Kafka?

7191
05:13:17,310 --> 05:13:19,900
Then we will discuss
about different components

7192
05:13:19,900 --> 05:13:22,000
of Apache Kafka
and it's architecture.

7193
05:13:22,000 --> 05:13:24,899
Further we will look
at different Kafka commands.

7194
05:13:24,899 --> 05:13:25,546
After that.

7195
05:13:25,546 --> 05:13:27,994
We'll take a brief overview
of Apache spark

7196
05:13:27,994 --> 05:13:30,700
and will understand
different spark components.

7197
05:13:30,700 --> 05:13:31,201
Finally.

7198
05:13:31,201 --> 05:13:32,579
We'll look at the demo

7199
05:13:32,579 --> 05:13:35,900
where we will use spark
streaming with Apache caf-pow.

7200
05:13:36,100 --> 05:13:37,600
Let's move to our first slide.

7201
05:13:37,900 --> 05:13:39,323
So in a real time scenario,

7202
05:13:39,323 --> 05:13:41,500
we have different
systems of services,

7203
05:13:41,500 --> 05:13:43,000
which will be communicating

7204
05:13:43,000 --> 05:13:46,200
with each other and
the data pipelines are the ones

7205
05:13:46,200 --> 05:13:48,800
which are establishing
connection between two servers

7206
05:13:48,800 --> 05:13:49,953
or two systems.

7207
05:13:50,000 --> 05:13:52,100
Now, let's take
an example of e-commerce.

7208
05:13:52,100 --> 05:13:55,255
Except site where it can have
multiple servers at front end

7209
05:13:55,255 --> 05:13:58,161
like Weber application server
for hosting application.

7210
05:13:58,161 --> 05:13:59,530
It can have a chat server

7211
05:13:59,530 --> 05:14:01,958
for the customers
to provide chart facilities.

7212
05:14:01,958 --> 05:14:04,900
Then it can have a separate
server for payment Etc.

7213
05:14:04,900 --> 05:14:08,145
Similarly organization can also
have multiple server

7214
05:14:08,145 --> 05:14:09,100
at the back end

7215
05:14:09,100 --> 05:14:11,900
which will be receiving messages
from different front end servers

7216
05:14:11,900 --> 05:14:13,200
based on the requirements.

7217
05:14:13,400 --> 05:14:15,600
Now they can have
a database server

7218
05:14:15,600 --> 05:14:17,700
which will be storing
the records then they

7219
05:14:17,700 --> 05:14:20,100
can have security systems
for user authentication

7220
05:14:20,100 --> 05:14:21,916
and authorization then
they can have

7221
05:14:21,916 --> 05:14:23,368
Real-time monitoring server,

7222
05:14:23,368 --> 05:14:25,600
which is basically
used for recommendations.

7223
05:14:25,600 --> 05:14:28,100
So all these data
pipelines becomes complex

7224
05:14:28,100 --> 05:14:30,200
with the increase
in number of systems

7225
05:14:30,200 --> 05:14:31,594
and adding a new system

7226
05:14:31,594 --> 05:14:33,900
or server requires
more data pipelines,

7227
05:14:33,900 --> 05:14:35,900
which will again
make the data flow

7228
05:14:35,900 --> 05:14:37,800
more complicated and complex.

7229
05:14:37,800 --> 05:14:38,662
Now managing.

7230
05:14:38,662 --> 05:14:41,646
These data pipelines also
become very difficult

7231
05:14:41,646 --> 05:14:45,100
as each data pipeline has
their own set of requirements

7232
05:14:45,100 --> 05:14:46,700
for example data pipelines,

7233
05:14:46,700 --> 05:14:49,700
which handles transaction
should be more fault tolerant

7234
05:14:49,700 --> 05:14:51,700
and robust on the other hand.

7235
05:14:51,700 --> 05:14:54,372
Clickstream data pipeline
can be more fragile.

7236
05:14:54,372 --> 05:14:55,784
So adding some pipelines

7237
05:14:55,784 --> 05:14:58,400
or removing some pipelines
becomes more difficult

7238
05:14:58,400 --> 05:14:59,600
from the complex system.

7239
05:14:59,800 --> 05:15:02,800
So now I hope that you would
have understood the problem

7240
05:15:02,800 --> 05:15:05,400
due to which misting
systems was originated.

7241
05:15:05,400 --> 05:15:08,200
Let's move to the next slide
and we'll understand

7242
05:15:08,200 --> 05:15:11,970
how Kafka solves this problem
now measuring system reduces

7243
05:15:11,970 --> 05:15:13,835
the complexity of data pipelines

7244
05:15:13,835 --> 05:15:16,600
and makes the communication
between systems more

7245
05:15:16,600 --> 05:15:19,780
simpler and manageable
using messaging system.

7246
05:15:19,780 --> 05:15:22,500
Now, you can easily
stablish remote Education

7247
05:15:22,500 --> 05:15:25,063
and send your data
easily across Netbook.

7248
05:15:25,063 --> 05:15:26,536
Now a different systems

7249
05:15:26,536 --> 05:15:29,100
may use different
platforms and languages

7250
05:15:29,200 --> 05:15:30,200
and messaging system

7251
05:15:30,200 --> 05:15:32,852
provides you a common
Paradigm independent

7252
05:15:32,852 --> 05:15:34,560
of any platformer language.

7253
05:15:34,560 --> 05:15:36,900
So basically it
decouples the platform

7254
05:15:36,900 --> 05:15:39,800
on which a front end server as
well as your back-end server

7255
05:15:39,800 --> 05:15:43,600
is running you can also stablish
a no synchronous communication

7256
05:15:43,600 --> 05:15:44,800
and send messages

7257
05:15:44,800 --> 05:15:47,000
so that the sender
does not have to wait

7258
05:15:47,000 --> 05:15:49,000
for the receiver
to process the messages.

7259
05:15:49,200 --> 05:15:51,300
Now one of the benefit
of messaging system is

7260
05:15:51,300 --> 05:15:53,295
that you can
Reliable communication.

7261
05:15:53,295 --> 05:15:56,600
So even when the receiver and
network is not working properly.

7262
05:15:56,600 --> 05:15:59,272
Your messages wouldn't
get lost not talking

7263
05:15:59,272 --> 05:16:02,900
about cough cough cough cough
decouples the data pipelines

7264
05:16:02,900 --> 05:16:06,205
and solves the complexity
problem the applications

7265
05:16:06,205 --> 05:16:10,050
which are producing messages
to Kafka are called producers

7266
05:16:10,050 --> 05:16:11,400
and the applications

7267
05:16:11,400 --> 05:16:13,600
which are consuming
those messages from Kafka

7268
05:16:13,600 --> 05:16:14,706
are called consumers.

7269
05:16:14,706 --> 05:16:17,500
Now, as you can see in the image
the front end server,

7270
05:16:17,500 --> 05:16:20,200
then your application server
will burn application server

7271
05:16:20,200 --> 05:16:21,500
to and chat server.

7272
05:16:21,500 --> 05:16:25,500
I using messages to Kafka
and these are called producers

7273
05:16:25,500 --> 05:16:26,985
and your database server

7274
05:16:26,985 --> 05:16:29,594
security systems real-time
monitoring server

7275
05:16:29,594 --> 05:16:31,900
than other services
and data warehouse.

7276
05:16:31,900 --> 05:16:34,300
These are basically
consuming the messages

7277
05:16:34,300 --> 05:16:35,900
and are called consumers.

7278
05:16:36,100 --> 05:16:39,600
So your producer sends
the message to Kafka

7279
05:16:39,700 --> 05:16:42,781
and then cough cash
to those messages and consumers

7280
05:16:42,781 --> 05:16:45,000
who want those
messages can subscribe

7281
05:16:45,000 --> 05:16:47,607
and receive them now
you can also have

7282
05:16:47,607 --> 05:16:51,191
multiple subscribers to
a single category of messages.

7283
05:16:51,191 --> 05:16:52,623
So you Database server

7284
05:16:52,623 --> 05:16:56,400
and your security system can
be consuming the same messages

7285
05:16:56,400 --> 05:16:58,600
which is produced
by application server

7286
05:16:58,600 --> 05:17:01,423
1 and again adding
a new consumer is very easy.

7287
05:17:01,423 --> 05:17:03,658
You can go ahead and
add a new consumer

7288
05:17:03,658 --> 05:17:06,268
and just subscribe
to the message categories

7289
05:17:06,268 --> 05:17:07,300
that is required.

7290
05:17:07,300 --> 05:17:10,700
So again, you can add
a new consumer say consumer one

7291
05:17:10,700 --> 05:17:13,100
and you can again
go ahead and subscribe

7292
05:17:13,100 --> 05:17:14,570
to the category of messages

7293
05:17:14,570 --> 05:17:17,100
which is produced by
application server one.

7294
05:17:17,100 --> 05:17:19,100
So, let's quickly move ahead.

7295
05:17:19,100 --> 05:17:21,606
Let's talk about
a Bocce Kafka so party.

7296
05:17:21,606 --> 05:17:24,853
Kafka is a distributed
publish/subscribe messaging

7297
05:17:24,853 --> 05:17:28,300
system messaging traditionally
has two models queuing

7298
05:17:28,300 --> 05:17:32,173
and publish/subscribe in a queue
a pool of consumers.

7299
05:17:32,173 --> 05:17:33,769
May read from a server

7300
05:17:33,769 --> 05:17:36,540
and each record only
goes to one of them

7301
05:17:36,540 --> 05:17:38,600
whereas in publish/subscribe.

7302
05:17:38,600 --> 05:17:41,313
The record is broadcasted
to all consumers.

7303
05:17:41,313 --> 05:17:43,722
So multiple consumers
can get the record

7304
05:17:43,722 --> 05:17:45,700
the Kafka cluster is distributed

7305
05:17:45,700 --> 05:17:48,374
and have multiple machines
running in parallel.

7306
05:17:48,374 --> 05:17:50,700
And this is the reason
why calf pies fast

7307
05:17:50,700 --> 05:17:52,000
scalable and fault.

7308
05:17:52,300 --> 05:17:53,309
Now let me tell you

7309
05:17:53,309 --> 05:17:55,700
that Kafka is developed
at LinkedIn and later.

7310
05:17:55,700 --> 05:17:57,700
It became a part
of Apache project.

7311
05:17:57,900 --> 05:18:01,100
Now, let us look at some
of the important terminologies.

7312
05:18:01,100 --> 05:18:03,499
So we'll first start with topic.

7313
05:18:03,499 --> 05:18:05,081
So topic is a category

7314
05:18:05,081 --> 05:18:08,100
or feed name to which
records are published

7315
05:18:08,100 --> 05:18:11,226
and Topic in Kafka are
always multi subscriber.

7316
05:18:11,226 --> 05:18:14,800
That is a topic can have
zero one or multiple consumers

7317
05:18:14,800 --> 05:18:16,600
that can subscribe the topic

7318
05:18:16,600 --> 05:18:19,300
and consume the data written
to it for an example.

7319
05:18:19,300 --> 05:18:21,850
You can have serious record
getting published in sales, too.

7320
05:18:21,850 --> 05:18:23,500
Topic you can
have product records

7321
05:18:23,500 --> 05:18:25,600
which is getting published
in product topic

7322
05:18:25,600 --> 05:18:28,965
and so on this will actually
segregate your messages

7323
05:18:28,965 --> 05:18:31,756
and consumer will only
subscribe the topic

7324
05:18:31,756 --> 05:18:35,500
that they need and again you
consumer can also subscribe

7325
05:18:35,500 --> 05:18:37,300
to two or more topics.

7326
05:18:37,300 --> 05:18:40,100
Now, let's talk
about partitions.

7327
05:18:40,100 --> 05:18:44,253
So Kafka topics are divided
into a number of partitions

7328
05:18:44,253 --> 05:18:47,800
and partitions allow
you to paralyze a topic

7329
05:18:47,800 --> 05:18:49,284
by splitting the data

7330
05:18:49,284 --> 05:18:51,846
in a particular
topic across multiple.

7331
05:18:51,846 --> 05:18:55,200
Brokers which means
each partition can be placed

7332
05:18:55,200 --> 05:18:58,869
on separate machine to allow
multiple consumers to read

7333
05:18:58,869 --> 05:19:00,500
from a topic parallelly.

7334
05:19:00,500 --> 05:19:02,700
So in case of serious
topic you can have

7335
05:19:02,700 --> 05:19:05,700
three partition partition
0 partition 1 and partition

7336
05:19:05,700 --> 05:19:09,400
to from where three consumers
can read data parallel.

7337
05:19:09,400 --> 05:19:10,481
Now moving ahead.

7338
05:19:10,481 --> 05:19:12,200
Let's talk about producers.

7339
05:19:12,200 --> 05:19:13,845
So producers are the one

7340
05:19:13,845 --> 05:19:17,000
who publishes the data
to topics of the choice.

7341
05:19:17,000 --> 05:19:18,600
Then you have consumers

7342
05:19:18,600 --> 05:19:21,786
so consumers can subscribe
to one or more topic.

7343
05:19:21,786 --> 05:19:22,910
And consume data

7344
05:19:22,910 --> 05:19:26,773
from that topic now consumers
basically label themselves

7345
05:19:26,773 --> 05:19:28,600
with a consumer group name

7346
05:19:28,600 --> 05:19:31,900
and each record publish
to a topic is delivered

7347
05:19:31,900 --> 05:19:35,703
to one consumer instance within
each subscribing consumer group.

7348
05:19:35,703 --> 05:19:37,536
So suppose you have
a consumer group.

7349
05:19:37,536 --> 05:19:40,072
Let's say consumer Group
1 and then you have

7350
05:19:40,072 --> 05:19:41,900
three consumers residing in it.

7351
05:19:41,900 --> 05:19:45,400
That is consumer a consumer be
an consumer see now

7352
05:19:45,400 --> 05:19:47,015
from the seals topic.

7353
05:19:47,100 --> 05:19:51,600
Each record can be read once
by consumer group Fun and it

7354
05:19:51,600 --> 05:19:56,200
And either be read by consumer a
or consumer be or consumer see

7355
05:19:56,200 --> 05:20:00,337
but it can only be consumed once
by the single consumer group

7356
05:20:00,337 --> 05:20:02,200
that is consumer group one.

7357
05:20:02,200 --> 05:20:05,700
But again, you can have
multiple consumer groups

7358
05:20:05,700 --> 05:20:07,700
which can subscribe to a topic

7359
05:20:07,700 --> 05:20:11,260
where one record can be consumed
by multiple consumers.

7360
05:20:11,260 --> 05:20:14,226
That is one consumer
from each consumer group.

7361
05:20:14,226 --> 05:20:16,842
So now let's say
you have a consumer one

7362
05:20:16,842 --> 05:20:19,291
and consumer group
to in consumer Group

7363
05:20:19,291 --> 05:20:20,600
1 we have to consumer

7364
05:20:20,600 --> 05:20:22,854
that is consumer a a
and consumer be

7365
05:20:22,854 --> 05:20:24,400
and consumer group to we

7366
05:20:24,400 --> 05:20:27,819
have to Consumers consumer key
and consumer to be so

7367
05:20:27,819 --> 05:20:30,229
if consumer Group
1 and consumer group

7368
05:20:30,229 --> 05:20:32,900
2 are consuming messages
from topic sales.

7369
05:20:32,900 --> 05:20:36,000
So the single record will be
consumed by consumer group one

7370
05:20:36,000 --> 05:20:39,111
as well as consumer group
2 and a single consumer

7371
05:20:39,111 --> 05:20:43,000
from both the consumer group
will consume the record once so,

7372
05:20:43,000 --> 05:20:45,900
I guess you are clear
with the concept of consumer

7373
05:20:45,900 --> 05:20:49,124
and consumer group Now
consumer instances can be

7374
05:20:49,124 --> 05:20:51,800
a separate process
or separate machines.

7375
05:20:51,900 --> 05:20:55,918
No talking about Brokers Brokers
are nothing but a single machine

7376
05:20:55,918 --> 05:20:57,300
in the CAF per cluster

7377
05:20:57,300 --> 05:21:00,800
and zookeeper is another Apache
open source project.

7378
05:21:00,800 --> 05:21:03,536
It's Tuesday metadata
information related

7379
05:21:03,536 --> 05:21:04,700
to Kafka cluster.

7380
05:21:04,700 --> 05:21:08,100
Like Brokers information
topics details Etc.

7381
05:21:08,100 --> 05:21:09,933
Zookeeper is basically the one

7382
05:21:09,933 --> 05:21:12,316
who is managing
the whole Kafka cluster.

7383
05:21:12,316 --> 05:21:14,700
Now, let's quickly go
to the next slide.

7384
05:21:14,700 --> 05:21:16,900
So suppose you have a topic.

7385
05:21:16,900 --> 05:21:21,100
Let's assume this is topic sales
and you have for partition

7386
05:21:21,100 --> 05:21:23,900
so you have Partition
0 partition 1 partition

7387
05:21:23,900 --> 05:21:27,600
to and partition three now you
have five Brokers over here.

7388
05:21:27,614 --> 05:21:30,768
Now, let's take the case
of partition 1 so

7389
05:21:30,850 --> 05:21:34,800
if the replication factor
is 3 it will have 3 copies

7390
05:21:34,800 --> 05:21:37,100
which will reside
on different Brokers.

7391
05:21:37,100 --> 05:21:40,121
So when the replica is
on broker to next is

7392
05:21:40,121 --> 05:21:43,000
on broker 3 and next is
on brokered 5 and

7393
05:21:43,000 --> 05:21:44,800
as you can see repl 5,

7394
05:21:45,000 --> 05:21:47,800
so this 5 is from this broker 5.

7395
05:21:48,100 --> 05:21:52,500
So the ID of the replica
is same as the ID of The broker

7396
05:21:52,500 --> 05:21:55,700
that hosts it now moving ahead.

7397
05:21:55,700 --> 05:21:57,100
One of the replica

7398
05:21:57,100 --> 05:22:00,800
of partition one will serve
as the leader replica.

7399
05:22:00,800 --> 05:22:02,074
So now the leader

7400
05:22:02,074 --> 05:22:06,200
of partition one is replica
five and any consumer coming

7401
05:22:06,200 --> 05:22:07,684
and consuming messages

7402
05:22:07,684 --> 05:22:10,944
from partition one will
be solved by this replica.

7403
05:22:10,944 --> 05:22:14,635
And these two replicas is
basically for fault tolerance.

7404
05:22:14,635 --> 05:22:17,343
So that once you're
broken five goes off

7405
05:22:17,343 --> 05:22:19,264
or your disc becomes corrupt,

7406
05:22:19,264 --> 05:22:21,115
so your replica 3 or replica

7407
05:22:21,115 --> 05:22:24,100
to to one of them
will again serve as a leader

7408
05:22:24,100 --> 05:22:26,938
and this is basically
decided on the basis

7409
05:22:26,938 --> 05:22:28,600
of most in sync replica.

7410
05:22:28,600 --> 05:22:30,587
So the replica
which will be most

7411
05:22:30,587 --> 05:22:34,100
in sync with this replica
will become the next leader.

7412
05:22:34,100 --> 05:22:36,700
So similarly this
partition 0 may decide

7413
05:22:36,700 --> 05:22:40,400
on broker one broker to
and broker three again

7414
05:22:40,400 --> 05:22:44,500
your partition to May
reside on broke of for group

7415
05:22:44,500 --> 05:22:46,800
of five and say broker one

7416
05:22:46,900 --> 05:22:49,500
and then your third
partition might reside

7417
05:22:49,500 --> 05:22:51,500
on these three brokers.

7418
05:22:51,700 --> 05:22:54,900
So suppose that this is
the leader for partition

7419
05:22:54,900 --> 05:22:56,378
to this is the leader

7420
05:22:56,378 --> 05:22:59,900
for partition 0 this is
the leader for partition 3.

7421
05:22:59,900 --> 05:23:02,900
This is the leader
for partition 1 right

7422
05:23:02,900 --> 05:23:03,600
so you can see

7423
05:23:03,600 --> 05:23:08,300
that for consumers can consume
data pad Ali from these Brokers

7424
05:23:08,300 --> 05:23:10,798
so it can consume
data from partition

7425
05:23:10,798 --> 05:23:14,200
to this consumer can consume
data from partition 0

7426
05:23:14,200 --> 05:23:17,800
and similarly for partition
3 and partition fun

7427
05:23:18,100 --> 05:23:21,500
now by maintaining
the replica basically helps.

7428
05:23:21,500 --> 05:23:25,433
Sin fault tolerance and keeping
different partition leaders

7429
05:23:25,433 --> 05:23:29,300
on different Brokers basically
helps in parallel execution

7430
05:23:29,300 --> 05:23:32,300
or you can say baddeley
consuming those messages.

7431
05:23:32,300 --> 05:23:34,391
So I hope that you
guys are clear

7432
05:23:34,391 --> 05:23:36,972
about topics partitions
and replicas now,

7433
05:23:36,972 --> 05:23:38,803
let's move to our next slide.

7434
05:23:38,803 --> 05:23:42,062
So this is how the whole
Kafka cluster looks like you

7435
05:23:42,062 --> 05:23:43,567
have multiple producers,

7436
05:23:43,567 --> 05:23:46,200
which is again producing
messages to Kafka.

7437
05:23:46,200 --> 05:23:48,600
Then this whole is
the Kafka cluster

7438
05:23:48,600 --> 05:23:51,590
where you have two nodes node
one has to broker.

7439
05:23:51,590 --> 05:23:55,128
Joker one and broker to
and the Note II has two Brokers

7440
05:23:55,128 --> 05:23:58,600
which is broker three and broke
of for again consumers

7441
05:23:58,600 --> 05:24:01,434
will be consuming data
from these Brokers

7442
05:24:01,434 --> 05:24:03,135
and zookeeper is the one

7443
05:24:03,135 --> 05:24:05,900
who is managing
this whole calf cluster.

7444
05:24:06,200 --> 05:24:07,100
Now, let's look

7445
05:24:07,100 --> 05:24:10,688
at some basic commands of Kafka
and understand how Kafka Works

7446
05:24:10,688 --> 05:24:12,500
how to go ahead
and start zookeeper

7447
05:24:12,500 --> 05:24:14,708
how to go ahead
and start Kafka server

7448
05:24:14,708 --> 05:24:16,200
and how to again go ahead

7449
05:24:16,200 --> 05:24:19,141
and produce some messages
to Kafka and then consume

7450
05:24:19,141 --> 05:24:20,600
some messages to Kafka.

7451
05:24:20,600 --> 05:24:21,800
So let me quickly.

7452
05:24:21,800 --> 05:24:27,200
on my VM So let me
quickly open the terminal.

7453
05:24:28,600 --> 05:24:31,400
Let me quickly go ahead
and execute sudo GPS

7454
05:24:31,400 --> 05:24:33,180
so that I can check
all the demons

7455
05:24:33,180 --> 05:24:34,800
that are running in my system.

7456
05:24:35,400 --> 05:24:37,095
So you can see I have named

7457
05:24:37,095 --> 05:24:40,800
no data node resource manager
node manager job is to server.

7458
05:24:42,000 --> 05:24:43,933
So now as all the hdfs demons

7459
05:24:43,933 --> 05:24:46,200
are burning let us
quickly go ahead

7460
05:24:46,200 --> 05:24:48,100
and start Kafka services.

7461
05:24:48,100 --> 05:24:50,561
So first I will go
to Kafka home.

7462
05:24:51,400 --> 05:24:53,800
So let me show
you the directory.

7463
05:24:53,800 --> 05:24:56,200
So my Kafka is in user lib.

7464
05:24:56,600 --> 05:24:56,900
Now.

7465
05:24:56,900 --> 05:25:00,088
Let me quickly go ahead
and start zookeeper service.

7466
05:25:00,088 --> 05:25:01,087
But before that,

7467
05:25:01,087 --> 05:25:03,900
let me show you
zookeeper dot properties file.

7468
05:25:06,415 --> 05:25:10,800
So decline Port is 2 1 8 1 so
my zookeeper will be running

7469
05:25:10,800 --> 05:25:12,300
on Port to 181

7470
05:25:12,600 --> 05:25:15,400
and the data directory
in which my zookeeper

7471
05:25:15,400 --> 05:25:19,700
will store all the metadata
is slash temp / zookeeper.

7472
05:25:20,000 --> 05:25:23,200
So let us quickly go ahead
and start zookeeper

7473
05:25:23,400 --> 05:25:28,300
and the command is bins
zookeeper server start.

7474
05:25:28,900 --> 05:25:30,500
So this is the script file

7475
05:25:30,500 --> 05:25:33,300
and then I'll pass
the properties file

7476
05:25:33,357 --> 05:25:37,988
which is inside config directory
and a little Meanwhile,

7477
05:25:37,988 --> 05:25:39,834
let me open another tab.

7478
05:25:40,403 --> 05:25:44,096
So here I will be starting
my first Kafka broker.

7479
05:25:44,200 --> 05:25:47,200
But before that let me show
you the properties file.

7480
05:25:47,576 --> 05:25:50,423
So we'll go
in config directory again,

7481
05:25:51,100 --> 05:25:53,700
and I have
server dot properties.

7482
05:25:54,400 --> 05:25:58,300
So this is the properties
of my first Kafka broker.

7483
05:25:59,507 --> 05:26:01,892
So first we have server Basics.

7484
05:26:02,300 --> 05:26:06,400
So here the broker idea
of my first broker is 0 then

7485
05:26:06,400 --> 05:26:10,700
the port is 9:09 to on which
my first broker will be running.

7486
05:26:11,400 --> 05:26:14,500
So it contains all
the socket server settings

7487
05:26:14,657 --> 05:26:16,042
then moving ahead.

7488
05:26:16,049 --> 05:26:17,555
We have log base X.

7489
05:26:17,555 --> 05:26:21,139
So in that log Basics,
this is log directory,

7490
05:26:21,200 --> 05:26:23,500
which is / them / Kafka -

7491
05:26:23,500 --> 05:26:26,400
logs so over here
my Kafka will store

7492
05:26:26,400 --> 05:26:28,226
all those messages or records,

7493
05:26:28,226 --> 05:26:30,600
which will be produced
by The Producers.

7494
05:26:30,600 --> 05:26:31,799
So all the records

7495
05:26:31,799 --> 05:26:35,600
which belongs to broker 0
will be stored at this location.

7496
05:26:35,900 --> 05:26:39,200
Now, the next section is
internal topic settings

7497
05:26:39,200 --> 05:26:40,900
in which the offset topical.

7498
05:26:40,900 --> 05:26:42,500
application factor is 1

7499
05:26:42,500 --> 05:26:48,100
then transaction State log
replication factor is 1 Next

7500
05:26:48,384 --> 05:26:50,615
we have log retention policy.

7501
05:26:50,900 --> 05:26:54,500
So the log retention
ours is 168.

7502
05:26:54,500 --> 05:26:58,319
So your records will be stored
for 168 hours by default

7503
05:26:58,319 --> 05:27:00,300
and then it will be deleted.

7504
05:27:00,300 --> 05:27:02,300
Then you have
zookeeper properties

7505
05:27:02,300 --> 05:27:05,100
where we have specified
zookeeper connect and

7506
05:27:05,100 --> 05:27:07,482
as we have seen
in Zookeeper dot properties file

7507
05:27:07,482 --> 05:27:10,000
that are zookeeper
will be running on Port 2 1 8 1

7508
05:27:10,000 --> 05:27:12,000
so we are giving
the address of Zookeeper

7509
05:27:12,000 --> 05:27:13,900
that is localized
to one eight one

7510
05:27:14,300 --> 05:27:15,911
and at last we have group.

7511
05:27:15,911 --> 05:27:18,700
Coordinator setting so
let us quickly go ahead

7512
05:27:18,700 --> 05:27:20,700
and start the first broker.

7513
05:27:21,457 --> 05:27:24,842
So the script file is
Kafka server started sh

7514
05:27:24,900 --> 05:27:27,100
and then we have to give
the properties file,

7515
05:27:27,200 --> 05:27:31,000
which is server dot properties
for the first broker.

7516
05:27:31,200 --> 05:27:35,276
I'll hit enter and meanwhile,
let me open another tab.

7517
05:27:36,234 --> 05:27:39,865
now I'll show you
the next properties file,

7518
05:27:40,200 --> 05:27:43,400
which is Server 1.

7519
05:27:43,400 --> 05:27:44,600
Properties.

7520
05:27:45,300 --> 05:27:46,400
So the things

7521
05:27:46,400 --> 05:27:50,700
which you have to change
for creating a new broker

7522
05:27:51,000 --> 05:27:54,700
is first you have
to change the broker ID.

7523
05:27:54,900 --> 05:27:59,100
So my earlier book ID was 0
the new broker ID is 1 again,

7524
05:27:59,100 --> 05:28:02,255
you can replicate this file
and for a new server,

7525
05:28:02,255 --> 05:28:05,059
you have to change
the broker idea to to then

7526
05:28:05,059 --> 05:28:08,513
you have to change the port
because on 9:09 to already.

7527
05:28:08,513 --> 05:28:11,200
My first broker is running
that is broker 0

7528
05:28:11,200 --> 05:28:12,019
so my broker.

7529
05:28:12,019 --> 05:28:14,099
Should connect to
a different port

7530
05:28:14,099 --> 05:28:17,000
and here I have specified
nine zero nine three.

7531
05:28:17,700 --> 05:28:21,600
Next thing what you have
to change is the log directory.

7532
05:28:21,600 --> 05:28:25,830
So here I have added a -
1 to the default log directory.

7533
05:28:25,830 --> 05:28:27,400
So all these records

7534
05:28:27,400 --> 05:28:30,600
which is stored to my broker
one will be going

7535
05:28:30,600 --> 05:28:32,505
to this particular directory

7536
05:28:32,505 --> 05:28:35,500
that is slashed
and slashed cough call logs -

7537
05:28:35,500 --> 05:28:39,500
1 And rest of the
things are similar,

7538
05:28:39,700 --> 05:28:42,900
so let me quickly go ahead
and start second broker as well.

7539
05:28:45,800 --> 05:28:48,000
And let me open
one more terminal.

7540
05:28:51,569 --> 05:28:54,030
And I'll start
broker to as well.

7541
05:29:01,400 --> 05:29:06,475
So the Zookeeper started then
procurve one is also started

7542
05:29:06,475 --> 05:29:09,700
and this is broker
to which is also started

7543
05:29:09,702 --> 05:29:11,472
and this is proof of 3.

7544
05:29:12,600 --> 05:29:14,600
So now let me
quickly minimize this

7545
05:29:15,200 --> 05:29:17,300
and I'll open a new terminal.

7546
05:29:18,000 --> 05:29:20,800
Now first, let us look
at some commands later

7547
05:29:20,800 --> 05:29:21,900
to Kafka topics.

7548
05:29:21,900 --> 05:29:24,900
So I'll quickly go ahead
and create a topic.

7549
05:29:25,250 --> 05:29:29,250
So again, let me first go
to my Kafka home directory.

7550
05:29:31,700 --> 05:29:36,000
Then the script file
is Kafka top it dot sh,

7551
05:29:36,000 --> 05:29:37,762
then the first parameter

7552
05:29:37,762 --> 05:29:41,800
is create then we have to give
the address of zoo keeper

7553
05:29:41,800 --> 05:29:43,327
because zookeeper is the one

7554
05:29:43,327 --> 05:29:46,000
who is actually containing
all the details related

7555
05:29:46,000 --> 05:29:47,000
to your topic.

7556
05:29:47,700 --> 05:29:50,600
So the address of my zookeeper
is localized to one eight one

7557
05:29:50,700 --> 05:29:53,000
then we'll give the topic name.

7558
05:29:53,000 --> 05:29:56,076
So let me name the topic
as Kafka -

7559
05:29:56,076 --> 05:30:00,000
spark next we have to specify
the replication factor

7560
05:30:00,000 --> 05:30:01,100
of the topic.

7561
05:30:01,300 --> 05:30:04,900
So it will replicate all
the partitions inside the topic

7562
05:30:04,900 --> 05:30:05,700
that many times.

7563
05:30:06,600 --> 05:30:08,300
So replication -

7564
05:30:08,300 --> 05:30:10,900
Factor as we
have three Brokers,

7565
05:30:10,900 --> 05:30:15,600
so let me keep it as 3
and then we have partitions.

7566
05:30:15,800 --> 05:30:17,074
So I will keep it as

7567
05:30:17,074 --> 05:30:19,746
three because we have
three Brokers running

7568
05:30:19,746 --> 05:30:21,689
and our consumer can go ahead

7569
05:30:21,689 --> 05:30:23,700
and consume messages parallely

7570
05:30:23,700 --> 05:30:27,010
from three Brokers and
let me press enter.

7571
05:30:29,300 --> 05:30:32,000
So now you can see
the topic is created.

7572
05:30:32,000 --> 05:30:35,100
Now, let us quickly go ahead
and list all the topics.

7573
05:30:35,100 --> 05:30:36,100
So the command

7574
05:30:36,100 --> 05:30:40,200
for listing all the topics
is dot slash bin again.

7575
05:30:40,200 --> 05:30:44,200
We'll open cough car
topic script file then -

7576
05:30:44,200 --> 05:30:48,300
- list and again will provide
the address of Zookeeper.

7577
05:30:48,700 --> 05:30:50,000
So do again list the topic

7578
05:30:50,000 --> 05:30:53,674
we have to first go to
the CAF core topic script file.

7579
05:30:53,674 --> 05:30:55,200
Then we have to give -

7580
05:30:55,200 --> 05:30:59,300
- list parameter and next we
have to give the zookeepers.

7581
05:30:59,576 --> 05:31:02,423
Which is localhost
181 I'll hit enter.

7582
05:31:04,100 --> 05:31:07,000
And you can see
I have this Kafka -

7583
05:31:07,000 --> 05:31:11,000
spark the kafka's
park topic has been created.

7584
05:31:11,100 --> 05:31:11,407
Now.

7585
05:31:11,407 --> 05:31:14,176
Let me show you
one more thing again.

7586
05:31:14,176 --> 05:31:18,900
We'll go to when cuff
card topics not sh

7587
05:31:19,000 --> 05:31:21,100
and we'll describe this topic.

7588
05:31:21,900 --> 05:31:24,600
I will pass the address
of zoo keeper,

7589
05:31:24,800 --> 05:31:26,300
which is localhost

7590
05:31:26,600 --> 05:31:30,600
to one eight one and then
I'll pause the topic name,

7591
05:31:31,000 --> 05:31:34,700
which is Kafka - Spark

7592
05:31:36,400 --> 05:31:37,600
So now you can see here.

7593
05:31:37,600 --> 05:31:40,100
The topic is cough by spark.

7594
05:31:40,100 --> 05:31:43,400
The partition count is
3 the replication factor is 3

7595
05:31:43,400 --> 05:31:45,600
and the config is as follows.

7596
05:31:45,700 --> 05:31:49,900
So here you can see all the
three partitions of the topic

7597
05:31:49,900 --> 05:31:54,400
that is partition 0 partition 1
and partition 2 then the leader

7598
05:31:54,400 --> 05:31:57,400
for partition 0 is
broker to the leader

7599
05:31:57,400 --> 05:31:59,417
for partition one is broker 0

7600
05:31:59,417 --> 05:32:02,200
and leader for partition
to is broker one

7601
05:32:02,200 --> 05:32:06,194
so you can see we have different
partition leaders residing on

7602
05:32:06,194 --> 05:32:09,600
And Brokers, so this is
basically for load balancing.

7603
05:32:09,600 --> 05:32:11,900
So that different partition
could be served

7604
05:32:11,900 --> 05:32:13,000
from different Brokers

7605
05:32:13,000 --> 05:32:15,413
and it could be
consumed parallely again,

7606
05:32:15,413 --> 05:32:16,800
you can see the replica

7607
05:32:16,800 --> 05:32:20,512
of this partition is residing
in all the three Brokers same

7608
05:32:20,512 --> 05:32:23,200
with Partition 1 and same
with Partition to

7609
05:32:23,200 --> 05:32:25,700
and it's showing you
the insync replica.

7610
05:32:25,700 --> 05:32:27,100
So in synch replica,

7611
05:32:27,100 --> 05:32:30,600
the first is to then you have 0
and then you have 1

7612
05:32:30,600 --> 05:32:33,600
and similarly with
Partition 1 and 2.

7613
05:32:33,900 --> 05:32:35,100
So now let us quickly.

7614
05:32:35,100 --> 05:32:35,900
Go ahead.

7615
05:32:36,500 --> 05:32:38,346
I'll reduce this to 1/2.

7616
05:32:40,000 --> 05:32:42,200
Wake me up in one more terminal.

7617
05:32:43,300 --> 05:32:45,200
The reason why I'm doing this is

7618
05:32:45,200 --> 05:32:48,600
that we can actually produce
message from One console

7619
05:32:48,600 --> 05:32:51,700
and then we can receive
the message in another console.

7620
05:32:51,707 --> 05:32:56,092
So for that I'll start cough
cough console producer first.

7621
05:32:56,396 --> 05:32:57,703
So the command is

7622
05:32:58,000 --> 05:33:04,400
dot slash bin cough cough
console producer dot sh

7623
05:33:04,400 --> 05:33:06,100
and then in case

7624
05:33:06,100 --> 05:33:11,400
of producer you have to give
the parameter as broker - list,

7625
05:33:11,800 --> 05:33:18,000
which is Localhost 9:09 to you
can provide any of the Brokers

7626
05:33:18,000 --> 05:33:19,000
that is running

7627
05:33:19,000 --> 05:33:22,400
and it will again take the rest
of the Brokers from there.

7628
05:33:22,400 --> 05:33:25,794
So you just have to provide
the address of one broker.

7629
05:33:25,794 --> 05:33:28,100
You can also provide
a set of Brokers

7630
05:33:28,100 --> 05:33:30,000
so you can give it
as localhost colon.

7631
05:33:30,000 --> 05:33:33,800
9:09 2 comma Lu closed:
9 0 9 3 and similarly.

7632
05:33:33,800 --> 05:33:35,800
So here I am passing the address

7633
05:33:35,800 --> 05:33:39,700
of the first broker now next
I have to mention the topic.

7634
05:33:39,700 --> 05:33:41,900
So topic is Kafka Spark.

7635
05:33:43,700 --> 05:33:45,161
And I'll hit enter.

7636
05:33:45,500 --> 05:33:47,900
So my console
producer is started.

7637
05:33:47,900 --> 05:33:50,600
Let me produce
a message saying hi.

7638
05:33:51,000 --> 05:33:53,376
Now in the second terminal
I will go ahead

7639
05:33:53,376 --> 05:33:55,200
and start the console consumer.

7640
05:33:55,500 --> 05:34:00,700
So again, the command is
Kafka console consumer not sh

7641
05:34:00,800 --> 05:34:03,000
and then in case of consumer,

7642
05:34:03,000 --> 05:34:06,600
you have to give the parameter
as bootstrap server.

7643
05:34:07,800 --> 05:34:10,400
So this is the thing
to notice guys that in case

7644
05:34:10,400 --> 05:34:13,600
of producer you have to give
the broker list by in.

7645
05:34:13,600 --> 05:34:14,725
So of consumer,

7646
05:34:14,725 --> 05:34:19,000
you have to give bootstrap
server and it is again the same

7647
05:34:19,000 --> 05:34:23,389
that is localhost 9:09 to which
the address of my broker 0

7648
05:34:23,500 --> 05:34:25,807
and then I will give the topic

7649
05:34:25,807 --> 05:34:30,700
which is cuff cost park
now adding this parameter

7650
05:34:30,700 --> 05:34:32,100
that is from -

7651
05:34:32,100 --> 05:34:35,800
beginning will basically
give me messages stored

7652
05:34:35,800 --> 05:34:37,926
in that topic from beginning.

7653
05:34:37,926 --> 05:34:41,300
Otherwise, if I'm not giving
this parameter - -

7654
05:34:41,300 --> 05:34:43,200
from beginning I'll only

7655
05:34:43,200 --> 05:34:44,630
I'm the recent messages

7656
05:34:44,630 --> 05:34:48,300
that has been produced after
starting this console consumer.

7657
05:34:48,300 --> 05:34:49,484
So let me hit enter

7658
05:34:49,484 --> 05:34:52,600
and you can see I'll get
a message saying hi first.

7659
05:34:55,700 --> 05:34:57,267
Well, I'm sorry guys.

7660
05:34:57,267 --> 05:35:00,400
The topic name I
have given is not correct.

7661
05:35:00,400 --> 05:35:01,784
Sorry for my typo.

7662
05:35:01,784 --> 05:35:03,707
Let me quickly corrected.

7663
05:35:04,300 --> 05:35:05,800
And let me hit enter.

7664
05:35:06,800 --> 05:35:10,300
So as you can see,
I am receiving the messages.

7665
05:35:10,300 --> 05:35:13,900
I received High then let
me produce some more messages.

7666
05:35:19,200 --> 05:35:21,600
So now you can see
all the messages

7667
05:35:21,600 --> 05:35:22,858
that I am producing

7668
05:35:22,858 --> 05:35:26,900
from console producer is getting
consumed by console consumer.

7669
05:35:26,900 --> 05:35:30,466
Now this console producer
as well as console consumer

7670
05:35:30,466 --> 05:35:31,838
is basically used by

7671
05:35:31,838 --> 05:35:35,200
the developers to actually
test the Kafka cluster.

7672
05:35:35,200 --> 05:35:37,100
So what happens if you are

7673
05:35:37,100 --> 05:35:38,300
if there is a producer

7674
05:35:38,300 --> 05:35:40,300
which is running and
which is producing

7675
05:35:40,300 --> 05:35:43,196
those messages to Kafka
then you can go ahead

7676
05:35:43,196 --> 05:35:45,558
and you can start console
consumer and check

7677
05:35:45,558 --> 05:35:47,500
whether the producer
is producing.

7678
05:35:47,500 --> 05:35:49,900
Messages or not
or you can again go ahead

7679
05:35:49,900 --> 05:35:50,900
and check the format

7680
05:35:50,900 --> 05:35:53,860
in which your message are
getting produced to the topic.

7681
05:35:53,860 --> 05:35:56,988
Those kind of testing part
is done using console consumer

7682
05:35:56,988 --> 05:35:59,000
and similarly using
console producer.

7683
05:35:59,000 --> 05:36:01,500
You do something
like you are creating a consumer

7684
05:36:01,500 --> 05:36:04,900
so you can go ahead you can
produce a message to Kafka topic

7685
05:36:04,900 --> 05:36:06,000
and then you can check

7686
05:36:06,000 --> 05:36:08,700
whether your consumer is
consuming that message or not.

7687
05:36:08,700 --> 05:36:11,049
This is basically used
for testing now,

7688
05:36:11,049 --> 05:36:13,400
let us quickly go ahead
and close this.

7689
05:36:15,700 --> 05:36:18,700
Now let us get back
to our slides now.

7690
05:36:18,700 --> 05:36:20,605
I have briefly covered Kafka

7691
05:36:20,605 --> 05:36:24,300
and the concepts of Kafka so
here basically I'm giving

7692
05:36:24,300 --> 05:36:27,200
you a small brief idea
about what Kafka is

7693
05:36:27,200 --> 05:36:29,100
and how Kafka works now

7694
05:36:29,100 --> 05:36:32,100
as we have understood why
we need misting systems.

7695
05:36:32,100 --> 05:36:33,100
What is cough cough?

7696
05:36:33,100 --> 05:36:35,000
What are different
terminologies and Kafka

7697
05:36:35,000 --> 05:36:36,657
how Kafka architecture works

7698
05:36:36,657 --> 05:36:39,513
and we have seen some
of the basic cuff Pokemons.

7699
05:36:39,513 --> 05:36:41,000
So let us now understand.

7700
05:36:41,000 --> 05:36:42,600
What is Apache spark.

7701
05:36:42,800 --> 05:36:44,900
So basically Apache spark

7702
05:36:44,900 --> 05:36:47,802
is an Source cluster
Computing framework

7703
05:36:47,802 --> 05:36:51,300
for near real-time processing
now spark provides

7704
05:36:51,300 --> 05:36:54,205
an interface for programming
the entire cluster

7705
05:36:54,205 --> 05:36:56,047
with implicit data parallelism

7706
05:36:56,047 --> 05:36:59,300
and fault tolerance will talk
about how spark provides

7707
05:36:59,300 --> 05:37:02,900
fault tolerance but talking
about implicit data parallelism.

7708
05:37:02,900 --> 05:37:06,600
That means you do not need
any special directives operators

7709
05:37:06,600 --> 05:37:09,000
or functions to enable
parallel execution.

7710
05:37:09,000 --> 05:37:12,600
It sparked by default provides
the data parallelism spark

7711
05:37:12,600 --> 05:37:15,628
is designed to cover
a wide range of workloads such.

7712
05:37:15,628 --> 05:37:16,919
As batch applications

7713
05:37:16,919 --> 05:37:20,400
iterative algorithms interactive
queries machine learning

7714
05:37:20,400 --> 05:37:22,000
algorithms and streaming.

7715
05:37:22,000 --> 05:37:24,174
So basically the main feature

7716
05:37:24,174 --> 05:37:27,500
of spark is it's
in memory cluster Computing

7717
05:37:27,500 --> 05:37:30,900
that increases the processing
speed of the application.

7718
05:37:30,900 --> 05:37:34,763
So what spark does spark does
not store the data in discs,

7719
05:37:34,763 --> 05:37:36,950
but it does it
transforms the data

7720
05:37:36,950 --> 05:37:38,700
and keep the data in memory.

7721
05:37:38,700 --> 05:37:39,616
So that quickly

7722
05:37:39,616 --> 05:37:42,500
multiple operations can
be applied over the data

7723
05:37:42,500 --> 05:37:45,500
and the final result
is only stored in the disk

7724
05:37:45,500 --> 05:37:49,629
now a On-site Spa can also do
batch processing hundred times

7725
05:37:49,629 --> 05:37:51,108
faster than mapreduce.

7726
05:37:51,108 --> 05:37:54,400
And this is the reason why
a patches Park is to go

7727
05:37:54,400 --> 05:37:57,324
to tool for big data processing
in the industry.

7728
05:37:57,324 --> 05:38:00,000
Now, let's quickly move
ahead and understand

7729
05:38:00,000 --> 05:38:01,461
how spark does this

7730
05:38:01,600 --> 05:38:03,617
so the answer is rdd

7731
05:38:03,617 --> 05:38:07,700
that is resilient distributed
data sets now an rdd is

7732
05:38:07,700 --> 05:38:11,406
a read-only partitioned
collection of records and you

7733
05:38:11,406 --> 05:38:14,897
can see it is a fundamental
data structure of spa.

7734
05:38:14,897 --> 05:38:16,312
So basically, ERD is

7735
05:38:16,312 --> 05:38:19,522
an immutable distributed
collection of objects.

7736
05:38:19,522 --> 05:38:21,709
So each data set
in rdd is divided

7737
05:38:21,709 --> 05:38:23,300
into logical partitions,

7738
05:38:23,300 --> 05:38:25,639
which may be computed
on different nodes

7739
05:38:25,639 --> 05:38:28,400
of the cluster now already
can contain any type

7740
05:38:28,400 --> 05:38:30,800
of python Java or scale objects.

7741
05:38:30,800 --> 05:38:33,900
Now talking about
the fault tolerance rdd

7742
05:38:33,900 --> 05:38:37,900
is a fault-tolerant collection
of elements that can be operated

7743
05:38:37,900 --> 05:38:39,000
on in parallel.

7744
05:38:39,000 --> 05:38:40,500
Now, how are ready does

7745
05:38:40,500 --> 05:38:43,380
that if rdd is lost
it will automatically

7746
05:38:43,380 --> 05:38:45,609
be recomputed by using original.

7747
05:38:45,609 --> 05:38:49,300
Nations and this is how spot
provides fault tolerance.

7748
05:38:49,300 --> 05:38:51,255
So I hope that you
guys are clear

7749
05:38:51,255 --> 05:38:53,700
that house Park
provides fault tolerance.

7750
05:38:54,132 --> 05:38:57,500
Now let's talk about
how we can create rdds.

7751
05:38:57,500 --> 05:39:01,600
So there are two ways to create
rdds first is paralyzing

7752
05:39:01,600 --> 05:39:04,474
an existing collection
in your driver program,

7753
05:39:04,474 --> 05:39:06,200
or you can refer a data set

7754
05:39:06,200 --> 05:39:09,300
in an external storage systems
such as shared file system.

7755
05:39:09,300 --> 05:39:11,300
It can be hdfs Edge base

7756
05:39:11,300 --> 05:39:15,200
or any other data source
offering a Hadoop input format

7757
05:39:15,200 --> 05:39:16,800
now spark makes use

7758
05:39:16,800 --> 05:39:20,200
of the concept of rdd to achieve
fast and efficient operations.

7759
05:39:20,200 --> 05:39:22,600
Now, let's quickly move ahead

7760
05:39:22,600 --> 05:39:27,200
and look how already So
first we create an rdd

7761
05:39:27,200 --> 05:39:29,600
which you can create
either by referring

7762
05:39:29,600 --> 05:39:31,800
to an external storage system.

7763
05:39:31,800 --> 05:39:35,400
And then once you create
an rdd you can go ahead

7764
05:39:35,400 --> 05:39:37,800
and you can apply
multiple Transformations

7765
05:39:37,800 --> 05:39:38,800
over that are ready.

7766
05:39:39,100 --> 05:39:43,100
Like will perform
filter map Union Etc.

7767
05:39:43,100 --> 05:39:44,219
And then again,

7768
05:39:44,219 --> 05:39:48,400
it gives you a new rdd or you
can see the transformed rdd

7769
05:39:48,400 --> 05:39:51,500
and at last you apply
some action and get

7770
05:39:51,500 --> 05:39:55,100
the result now this action
can be Count first

7771
05:39:55,100 --> 05:39:57,149
a can collect all those kind

7772
05:39:57,149 --> 05:39:58,100
of functions.

7773
05:39:58,100 --> 05:40:01,151
So now this is a brief idea
about what is rdd

7774
05:40:01,151 --> 05:40:02,400
and how rdd works.

7775
05:40:02,400 --> 05:40:04,570
So now let's quickly
move ahead and look

7776
05:40:04,570 --> 05:40:06,100
at the different workloads

7777
05:40:06,100 --> 05:40:08,200
that can be handled
by Apache spark.

7778
05:40:08,200 --> 05:40:10,883
So we have interactive
streaming analytics.

7779
05:40:10,883 --> 05:40:12,800
Then we have machine learning.

7780
05:40:12,800 --> 05:40:14,158
We have data integration.

7781
05:40:14,158 --> 05:40:16,207
We have spark
streaming and processing.

7782
05:40:16,207 --> 05:40:17,944
So let us talk about them one

7783
05:40:17,944 --> 05:40:20,400
by one first is spark
streaming and processing.

7784
05:40:20,400 --> 05:40:21,400
So now basically,

7785
05:40:21,400 --> 05:40:24,007
you know data arrives
at a steady rate.

7786
05:40:24,007 --> 05:40:27,000
Are you can say
at a continuous streams, right?

7787
05:40:27,000 --> 05:40:29,300
And then what you can do
you can again go ahead

7788
05:40:29,300 --> 05:40:30,829
and store the data set in disk

7789
05:40:30,829 --> 05:40:34,299
and then you can actually go
ahead and apply some processing

7790
05:40:34,299 --> 05:40:36,007
over it some analytics over it

7791
05:40:36,007 --> 05:40:38,000
and then get
some results out of it,

7792
05:40:38,000 --> 05:40:41,200
but this is not the scenario
with each and every case.

7793
05:40:41,200 --> 05:40:44,100
Let's take an example
of financial transactions

7794
05:40:44,100 --> 05:40:46,343
where you have to go
ahead and identify

7795
05:40:46,343 --> 05:40:48,931
and refuse potential
fraudulent transactions.

7796
05:40:48,931 --> 05:40:50,297
Now if you will go ahead

7797
05:40:50,297 --> 05:40:53,197
and store the data stream
and then you will go ahead

7798
05:40:53,197 --> 05:40:55,800
and apply some Assessing
it would be too late

7799
05:40:55,800 --> 05:40:58,287
and someone would have got
away with the money.

7800
05:40:58,287 --> 05:41:00,386
So in that scenario
what you need to do.

7801
05:41:00,386 --> 05:41:03,183
So you need to quickly take
that input data stream.

7802
05:41:03,183 --> 05:41:05,700
You need to apply
some Transformations over it

7803
05:41:05,700 --> 05:41:08,300
and then you have
to take actions accordingly.

7804
05:41:08,300 --> 05:41:10,015
Like you can send
some notification

7805
05:41:10,015 --> 05:41:11,322
or you can actually reject

7806
05:41:11,322 --> 05:41:13,972
that fraudulent transaction
something like that.

7807
05:41:13,972 --> 05:41:15,200
And then you can go ahead

7808
05:41:15,200 --> 05:41:17,686
and if you want you
can store those results

7809
05:41:17,686 --> 05:41:19,700
or data set in some
of the database

7810
05:41:19,700 --> 05:41:21,700
or you can see some
of the file system.

7811
05:41:21,800 --> 05:41:24,000
So we have some scenarios.

7812
05:41:24,026 --> 05:41:27,873
Very we have to actually
process the stream of data

7813
05:41:27,900 --> 05:41:29,300
and then we have to go ahead

7814
05:41:29,300 --> 05:41:30,358
and store the data

7815
05:41:30,358 --> 05:41:34,008
or perform some analysis on it
or take some necessary actions.

7816
05:41:34,008 --> 05:41:37,000
So this is where Spark
streaming comes into picture

7817
05:41:37,000 --> 05:41:38,575
and Spark is a best fit

7818
05:41:38,575 --> 05:41:42,000
for processing those continuous
input data streams.

7819
05:41:42,000 --> 05:41:45,500
Now moving to next
that is machine learning now,

7820
05:41:45,500 --> 05:41:46,314
as you know,

7821
05:41:46,314 --> 05:41:47,730
that first we create

7822
05:41:47,730 --> 05:41:51,182
a machine learning model
then we continuously feed

7823
05:41:51,182 --> 05:41:54,011
those incoming data
streams to the model.

7824
05:41:54,011 --> 05:41:56,700
And we get some
continuous output based

7825
05:41:56,700 --> 05:41:58,144
on the input values.

7826
05:41:58,144 --> 05:42:00,453
Now, we reuse
intermediate results

7827
05:42:00,453 --> 05:42:04,300
across multiple computation
in multi-stage applications,

7828
05:42:04,300 --> 05:42:07,600
which basically includes
substantial overhead due to

7829
05:42:07,600 --> 05:42:10,500
data replication disk
I/O and sterilization

7830
05:42:10,500 --> 05:42:12,200
which makes the system slow.

7831
05:42:12,200 --> 05:42:16,200
Now what Spock does spark rdd
will store intermediate result

7832
05:42:16,200 --> 05:42:19,446
in a distributed memory
instead of a stable storage

7833
05:42:19,446 --> 05:42:21,200
and make the system faster.

7834
05:42:21,200 --> 05:42:24,800
So as we saw in spark rdd
all the Transformations

7835
05:42:24,800 --> 05:42:26,482
will be applied over there

7836
05:42:26,482 --> 05:42:29,200
and all the transformed
rdds will be stored

7837
05:42:29,200 --> 05:42:31,999
in the memory itself
so we can quickly go ahead

7838
05:42:31,999 --> 05:42:35,037
and apply some more
iterative algorithms over there

7839
05:42:35,037 --> 05:42:37,508
and it does not take
much time in functions

7840
05:42:37,508 --> 05:42:39,333
like data replication or disk

7841
05:42:39,333 --> 05:42:42,164
I/O so all those overheads
will be reduced now

7842
05:42:42,164 --> 05:42:45,500
you might be wondering
that memories always very less.

7843
05:42:45,500 --> 05:42:48,000
So what if the memory
gets over so

7844
05:42:48,000 --> 05:42:50,600
if the distributed memory
is not sufficient

7845
05:42:50,600 --> 05:42:52,100
to store intermediate results,

7846
05:42:52,300 --> 05:42:54,300
then it will
store those results.

7847
05:42:54,300 --> 05:42:55,100
On the desk.

7848
05:42:55,100 --> 05:42:58,000
So I hope that you guys are
clear how sparks perform

7849
05:42:58,000 --> 05:43:00,000
this iterative machine
learning algorithms

7850
05:43:00,000 --> 05:43:01,500
and why spark is fast,

7851
05:43:01,819 --> 05:43:04,280
let's look at the next workload.

7852
05:43:04,400 --> 05:43:08,200
So next workload is
interactive streaming analytics.

7853
05:43:08,200 --> 05:43:10,900
Now as we already discussed
about streaming data

7854
05:43:10,900 --> 05:43:15,300
so user runs ad hoc queries
on the same subset of data

7855
05:43:15,300 --> 05:43:19,127
and each query will do a disk
I/O on the stable storage

7856
05:43:19,127 --> 05:43:22,386
which can dominate
applications execution time.

7857
05:43:22,386 --> 05:43:24,300
So, let me take an example.

7858
05:43:24,300 --> 05:43:25,400
Data scientist.

7859
05:43:25,400 --> 05:43:27,800
So basically you have
continuous streams of data,

7860
05:43:27,800 --> 05:43:28,800
which is coming in.

7861
05:43:28,800 --> 05:43:30,650
So what your data
scientists would do.

7862
05:43:30,650 --> 05:43:32,900
So do your data scientists
will either ask

7863
05:43:32,900 --> 05:43:34,274
some questions execute

7864
05:43:34,274 --> 05:43:37,208
some queries over the data
then view the result

7865
05:43:37,208 --> 05:43:40,563
and then he might alter
the initial question slightly

7866
05:43:40,563 --> 05:43:41,804
by seeing the output

7867
05:43:41,804 --> 05:43:44,332
or he might also drill
deeper into results

7868
05:43:44,332 --> 05:43:47,757
and execute some more queries
over the gathered result.

7869
05:43:47,757 --> 05:43:51,500
So there are multiple scenarios
in which your data scientist

7870
05:43:51,500 --> 05:43:54,265
would be running
some interactive queries.

7871
05:43:54,265 --> 05:43:57,569
On the streaming analytics
now house path helps

7872
05:43:57,569 --> 05:44:00,200
in this interactive
streaming analytics.

7873
05:44:00,200 --> 05:44:04,453
So each transformed our DD
may be recomputed each time.

7874
05:44:04,453 --> 05:44:06,838
You run an action on it, right?

7875
05:44:06,838 --> 05:44:10,692
And when you persist an rdd
in memory in which case

7876
05:44:10,692 --> 05:44:13,430
Park will keep all
the elements around

7877
05:44:13,430 --> 05:44:15,800
on the cluster for faster access

7878
05:44:15,800 --> 05:44:18,296
and whenever you will execute
the query next time

7879
05:44:18,296 --> 05:44:19,077
over the data,

7880
05:44:19,077 --> 05:44:21,200
then the query will
be executed quickly

7881
05:44:21,200 --> 05:44:23,700
and it will give you
a instant result, right?

7882
05:44:24,100 --> 05:44:26,090
So I hope that you
guys are clear

7883
05:44:26,090 --> 05:44:29,200
how spark helps in
interactive streaming analytics.

7884
05:44:29,400 --> 05:44:32,000
Now, let's talk
about data integration.

7885
05:44:32,000 --> 05:44:33,570
So basically as you know,

7886
05:44:33,570 --> 05:44:36,900
that in large organizations data
is basically produced

7887
05:44:36,900 --> 05:44:39,400
from different systems
across the business

7888
05:44:39,400 --> 05:44:42,000
and basically you
need a framework

7889
05:44:42,000 --> 05:44:45,800
which can actually integrate
different data sources.

7890
05:44:45,800 --> 05:44:46,900
So Spock is the one

7891
05:44:46,900 --> 05:44:49,382
which actually integrate
different data sources

7892
05:44:49,382 --> 05:44:50,500
so you can go ahead

7893
05:44:50,500 --> 05:44:53,800
and you can take the data
from Kafka Cassandra flu.

7894
05:44:53,800 --> 05:44:55,518
Umm hbase then Amazon S3.

7895
05:44:55,518 --> 05:44:59,300
Then you can perform some real
time analytics over there

7896
05:44:59,300 --> 05:45:02,000
or even say some near
real-time analytics over there.

7897
05:45:02,000 --> 05:45:04,250
You can apply some machine
learning algorithms

7898
05:45:04,250 --> 05:45:05,700
and then you can go ahead

7899
05:45:05,700 --> 05:45:08,500
and store the process
result in Apache hbase.

7900
05:45:08,500 --> 05:45:10,600
Then msql hdfs.

7901
05:45:10,600 --> 05:45:12,100
It could be your Kafka.

7902
05:45:12,100 --> 05:45:15,500
So spark basically gives
you a multiple options

7903
05:45:15,500 --> 05:45:16,600
where you can go ahead

7904
05:45:16,600 --> 05:45:18,500
and pick the data
from and again,

7905
05:45:18,500 --> 05:45:21,200
you can go ahead
and write the data into now.

7906
05:45:21,200 --> 05:45:23,620
Let's quickly move ahead
and we'll talk.

7907
05:45:23,620 --> 05:45:27,013
About different spark components
so you can see here.

7908
05:45:27,013 --> 05:45:28,500
I have a spark or engine.

7909
05:45:28,500 --> 05:45:30,376
So basically this
is the core engine

7910
05:45:30,376 --> 05:45:32,200
and on top of this core engine.

7911
05:45:32,200 --> 05:45:35,574
You have spark SQL spark
streaming then MLA,

7912
05:45:35,900 --> 05:45:38,100
then you have graphics
and the newest Parker.

7913
05:45:38,200 --> 05:45:41,087
Let's talk about them one
by one and we'll start

7914
05:45:41,087 --> 05:45:42,500
with spark core engine.

7915
05:45:42,500 --> 05:45:45,200
So spark or engine
is the base engine

7916
05:45:45,200 --> 05:45:46,800
for large-scale parallel

7917
05:45:46,800 --> 05:45:50,026
and distributed data processing
additional libraries,

7918
05:45:50,026 --> 05:45:52,200
which are built on top
of the core allows

7919
05:45:52,200 --> 05:45:53,700
divers workloads Force.

7920
05:45:53,700 --> 05:45:57,300
Streaming SQL machine learning
then you can go ahead

7921
05:45:57,300 --> 05:45:59,300
and execute our on spark

7922
05:45:59,300 --> 05:46:01,731
or you can go ahead
and execute python on spark

7923
05:46:01,731 --> 05:46:03,000
those kind of workloads.

7924
05:46:03,000 --> 05:46:04,700
You can easily go
ahead and execute.

7925
05:46:04,700 --> 05:46:07,800
So basically your spark
or engine is the one

7926
05:46:07,800 --> 05:46:10,040
who is managing all your memory,

7927
05:46:10,040 --> 05:46:13,084
then all your fault
recovery your scheduling

7928
05:46:13,084 --> 05:46:14,755
your Distributing of jobs

7929
05:46:14,755 --> 05:46:16,078
and monitoring jobs

7930
05:46:16,078 --> 05:46:19,700
on a cluster and interacting
with the storage system.

7931
05:46:19,700 --> 05:46:22,400
So in in short we
can see the spark

7932
05:46:22,400 --> 05:46:24,501
or engine is the heart of Spock

7933
05:46:24,501 --> 05:46:25,951
and on top of this all

7934
05:46:25,951 --> 05:46:28,389
of these libraries
are there so first,

7935
05:46:28,389 --> 05:46:30,429
let's talk about
spark streaming.

7936
05:46:30,429 --> 05:46:33,088
So spot streaming is
the component of Spas

7937
05:46:33,088 --> 05:46:36,273
which is used to process
real-time streaming data

7938
05:46:36,273 --> 05:46:37,600
as we just discussed

7939
05:46:37,600 --> 05:46:41,061
and it is a useful addition
to spark core API.

7940
05:46:41,200 --> 05:46:43,600
Now it enables high
throughput and fault

7941
05:46:43,600 --> 05:46:46,554
tolerance stream processing
for live data streams.

7942
05:46:46,554 --> 05:46:47,700
So you can go ahead

7943
05:46:47,700 --> 05:46:51,338
and you can perform all
the streaming data analytics

7944
05:46:51,338 --> 05:46:55,800
using this spark streaming then
You have Spock SQL over here.

7945
05:46:55,800 --> 05:46:58,900
So basically spark SQL is
a new module in spark

7946
05:46:58,900 --> 05:47:02,200
which integrates relational
processing of Sparks functional

7947
05:47:02,200 --> 05:47:06,900
programming API and it supports
querying data either via SQL

7948
05:47:06,900 --> 05:47:08,315
or SQL that is -

7949
05:47:08,315 --> 05:47:09,469
query language.

7950
05:47:09,500 --> 05:47:11,500
So basically for those of you

7951
05:47:11,500 --> 05:47:15,615
who are familiar with rdbms
Spock SQL is an easy transition

7952
05:47:15,615 --> 05:47:17,100
from your earlier tool

7953
05:47:17,100 --> 05:47:19,511
where you can go ahead
and extend the boundaries

7954
05:47:19,511 --> 05:47:22,100
of traditional relational
data processing now

7955
05:47:22,100 --> 05:47:23,700
talking about graphics.

7956
05:47:23,700 --> 05:47:24,900
So Graphics is

7957
05:47:24,900 --> 05:47:28,500
the spaag API for graphs
and crafts parallel computation.

7958
05:47:28,500 --> 05:47:30,800
It extends the spark rdd

7959
05:47:30,800 --> 05:47:34,309
with a resilient distributed
property graph a talking

7960
05:47:34,309 --> 05:47:35,213
at high level.

7961
05:47:35,213 --> 05:47:38,700
Basically Graphics extend
the graph already abstraction

7962
05:47:38,700 --> 05:47:41,758
by introducing the resilient
distributed property graph,

7963
05:47:41,758 --> 05:47:42,778
which is nothing

7964
05:47:42,778 --> 05:47:45,900
but a directed multigraph
with properties attached

7965
05:47:45,900 --> 05:47:49,700
to each vertex and Edge
next we have spark are so

7966
05:47:49,700 --> 05:47:52,394
basically it provides you
packages for our language

7967
05:47:52,394 --> 05:47:54,100
and then you can go ahead and

7968
05:47:54,100 --> 05:47:55,399
Leverage Park power

7969
05:47:55,399 --> 05:47:58,000
with our shell next
you have spark MLA.

7970
05:47:58,000 --> 05:48:01,849
So ml is basically stands
for machine learning library.

7971
05:48:01,849 --> 05:48:05,200
So spark MLM is used
to perform machine learning

7972
05:48:05,200 --> 05:48:06,500
in Apache spark.

7973
05:48:06,500 --> 05:48:08,773
Now many common machine learning

7974
05:48:08,773 --> 05:48:11,784
and statical algorithms
have been implemented

7975
05:48:11,784 --> 05:48:13,700
and are shipped with ML live

7976
05:48:13,700 --> 05:48:16,935
which simplifies large scale
machine learning pipelines,

7977
05:48:16,935 --> 05:48:18,347
which basically includes

7978
05:48:18,347 --> 05:48:20,994
summary statistics
correlations classification

7979
05:48:20,994 --> 05:48:23,800
and regression collaborative
filtering techniques.

7980
05:48:23,800 --> 05:48:25,700
New cluster analysis methods

7981
05:48:25,700 --> 05:48:28,582
then you have dimensionality
reduction techniques.

7982
05:48:28,582 --> 05:48:31,400
You have feature extraction
and transformation functions.

7983
05:48:31,400 --> 05:48:33,700
When you have
optimization algorithms,

7984
05:48:33,700 --> 05:48:35,900
it is basically a MLM package

7985
05:48:35,900 --> 05:48:39,000
or you can see a machine
learning package on top of spa.

7986
05:48:39,000 --> 05:48:41,639
Then you also have
something called by spark,

7987
05:48:41,639 --> 05:48:43,979
which is python package
for spark there.

7988
05:48:43,979 --> 05:48:46,800
You can go ahead
and leverage python over spark.

7989
05:48:46,800 --> 05:48:47,376
So I hope

7990
05:48:47,376 --> 05:48:50,900
that you guys are clear
with different spark components.

7991
05:48:51,100 --> 05:48:53,200
So before moving
to cough gasp,

7992
05:48:53,200 --> 05:48:54,524
ah, Exclaiming demo.

7993
05:48:54,524 --> 05:48:58,075
So I have just given you
a brief intro to Apache spark.

7994
05:48:58,075 --> 05:49:01,100
If you want a detailed tutorial
on Apache spark

7995
05:49:01,100 --> 05:49:02,600
or different components

7996
05:49:02,600 --> 05:49:06,753
of Apache spark like Apache
spark SQL spark data frames

7997
05:49:06,800 --> 05:49:10,200
or spark streaming
Spa Graphics Spock MLA,

7998
05:49:10,200 --> 05:49:13,200
so you can go to editor
Acres YouTube channel again.

7999
05:49:13,200 --> 05:49:14,800
So now we are here guys.

8000
05:49:14,800 --> 05:49:18,252
I know that you guys are waiting
for this demo from a while.

8001
05:49:18,252 --> 05:49:21,900
So now let's go ahead and look
at calf by spark streaming demo.

8002
05:49:21,900 --> 05:49:23,700
So let me quickly go
ahead and open.

8003
05:49:23,700 --> 05:49:28,000
my virtual machine
and I'll open a terminal.

8004
05:49:28,600 --> 05:49:30,658
So let me first check
all the demons

8005
05:49:30,658 --> 05:49:32,400
that are running in my system.

8006
05:49:33,800 --> 05:49:35,341
So my zookeeper is running

8007
05:49:35,341 --> 05:49:37,753
name node is running
data node is running.

8008
05:49:37,753 --> 05:49:39,130
The my resource manager

8009
05:49:39,130 --> 05:49:42,714
is running all the three cough
cough Brokers are running then

8010
05:49:42,714 --> 05:49:44,088
node manager is running

8011
05:49:44,088 --> 05:49:46,000
and job is to server is running.

8012
05:49:46,200 --> 05:49:49,200
So now I have to start
my spark demons.

8013
05:49:49,200 --> 05:49:51,900
So let me first go
to the spark home

8014
05:49:52,600 --> 05:49:54,600
and start this part demon.

8015
05:49:54,600 --> 05:49:57,800
The command is
a spin start or not.

8016
05:49:57,800 --> 05:49:58,900
Sh.

8017
05:50:01,400 --> 05:50:03,400
So let me quickly go ahead

8018
05:50:03,400 --> 05:50:06,861
and execute sudo JPS
to check my spark demons.

8019
05:50:08,500 --> 05:50:12,200
So you can see master
and vocal demons are running.

8020
05:50:12,596 --> 05:50:14,903
So let me close this terminal.

8021
05:50:16,300 --> 05:50:18,700
Let me go to
the project directory.

8022
05:50:20,600 --> 05:50:22,808
So basically, I
have two projects.

8023
05:50:22,808 --> 05:50:25,376
This is cough card
transaction producer.

8024
05:50:25,376 --> 05:50:28,852
And the next one is the spark
streaming Kafka master.

8025
05:50:28,852 --> 05:50:31,327
So first we will
be producing messages

8026
05:50:31,327 --> 05:50:33,400
from Kafka transaction producer

8027
05:50:33,400 --> 05:50:36,200
and then we'll be
streaming those records

8028
05:50:36,200 --> 05:50:39,670
which is basically produced by
this producer using the spark

8029
05:50:39,670 --> 05:50:41,025
streaming Kafka master.

8030
05:50:41,025 --> 05:50:42,494
So first, let me take you

8031
05:50:42,494 --> 05:50:45,100
through this cough
card transaction producer.

8032
05:50:45,100 --> 05:50:47,244
So this is
our cornbread XML file.

8033
05:50:47,244 --> 05:50:49,004
Let me open it with G edit.

8034
05:50:49,004 --> 05:50:50,700
So basically this is a me.

8035
05:50:50,700 --> 05:50:54,400
Project and and I have used
spring boot server.

8036
05:50:54,800 --> 05:50:57,071
So I have given Java version

8037
05:50:57,071 --> 05:51:00,456
as a you can see
cough cough client over here

8038
05:51:00,500 --> 05:51:02,900
and the version of Kafka client,

8039
05:51:03,780 --> 05:51:07,719
then you can see I'm putting
Jackson data bind.

8040
05:51:08,800 --> 05:51:13,500
Then ji-sun and then I
am packaging it as a war file

8041
05:51:13,600 --> 05:51:15,500
that is web archive file.

8042
05:51:15,500 --> 05:51:20,000
And here I am again specifying
the spring boot Maven plugins,

8043
05:51:20,000 --> 05:51:21,300
which is to be downloaded.

8044
05:51:21,300 --> 05:51:23,258
So let me quickly go ahead

8045
05:51:23,258 --> 05:51:27,100
and close this and we'll go
to this Source directory

8046
05:51:27,100 --> 05:51:29,125
and then we'll go inside main.

8047
05:51:29,125 --> 05:51:32,972
So basically this is the file
that is sales Jan 2009 file.

8048
05:51:32,972 --> 05:51:35,200
So let me show you
the file first.

8049
05:51:37,300 --> 05:51:38,860
So these are the records

8050
05:51:38,860 --> 05:51:41,200
which I'll be producing
to the Kafka.

8051
05:51:41,200 --> 05:51:43,600
So the fields
are transaction date

8052
05:51:43,600 --> 05:51:45,500
than product price payment

8053
05:51:45,500 --> 05:51:49,767
type the name city state
country account created

8054
05:51:49,800 --> 05:51:51,646
then last login latitude

8055
05:51:51,646 --> 05:51:52,846
and longitude.

8056
05:51:52,846 --> 05:51:57,400
So let me close this file
and then the application dot.

8057
05:51:57,400 --> 05:51:59,778
Yml is the main property file.

8058
05:51:59,900 --> 05:52:02,654
So in this application
dot yml am specifying

8059
05:52:02,654 --> 05:52:04,000
the bootstrap server,

8060
05:52:04,000 --> 05:52:07,900
which is localhost 9:09 to
than am specifying the Pause

8061
05:52:07,900 --> 05:52:11,500
which again resides
on localhost 9:09 to so here.

8062
05:52:11,500 --> 05:52:16,200
I have specified the broker list
now next I have product topic.

8063
05:52:16,200 --> 05:52:19,000
So the topic of the
product is transaction.

8064
05:52:19,000 --> 05:52:21,230
Then the partition count is 1

8065
05:52:21,500 --> 05:52:25,800
so basically you're a cks
config controls the criteria

8066
05:52:25,800 --> 05:52:29,100
under which requests
are considered complete

8067
05:52:29,100 --> 05:52:32,900
and the all setting we
have specified will result

8068
05:52:32,900 --> 05:52:35,828
in blocking on the full
Committee of the record.

8069
05:52:35,828 --> 05:52:37,225
It is the slowest burn

8070
05:52:37,225 --> 05:52:40,900
the most durable setting
not talking about retries.

8071
05:52:40,900 --> 05:52:44,600
So it will retry Thrice
then we have mempool size

8072
05:52:44,600 --> 05:52:46,587
and we have maximum pool size,

8073
05:52:46,587 --> 05:52:49,700
which is basically
for implementing Java threads

8074
05:52:49,700 --> 05:52:52,000
and at last we
have the file path.

8075
05:52:52,000 --> 05:52:53,900
So this is the path of the file,

8076
05:52:53,900 --> 05:52:57,900
which I have shown you just now
so messages will be consumed

8077
05:52:57,900 --> 05:52:58,800
from this file.

8078
05:52:58,800 --> 05:53:02,600
Let me quickly close this file
and we'll look at application

8079
05:53:02,600 --> 05:53:06,792
but properties so here we
have specified the properties

8080
05:53:06,792 --> 05:53:08,600
for Springboard server.

8081
05:53:08,700 --> 05:53:10,877
So we have server context path.

8082
05:53:10,877 --> 05:53:12,185
That is /n Eureka.

8083
05:53:12,185 --> 05:53:14,607
Then we have
spring application name

8084
05:53:14,607 --> 05:53:16,301
that is Kafka producer.

8085
05:53:16,301 --> 05:53:17,700
We have server Port

8086
05:53:17,700 --> 05:53:22,200
that is double line W8 and
the spring events timeout is 20.

8087
05:53:22,200 --> 05:53:24,430
So let me close this as well.

8088
05:53:24,430 --> 05:53:25,530
Let's go back.

8089
05:53:25,800 --> 05:53:29,500
Let's go inside Java calm
and Eureka Kafka.

8090
05:53:29,700 --> 05:53:33,400
So we'll explore
the important files one by one.

8091
05:53:33,400 --> 05:53:36,800
So let me first take you
through this dito directory.

8092
05:53:36,900 --> 05:53:39,617
And over here,
we have transaction dot Java.

8093
05:53:39,617 --> 05:53:42,253
So basically here we
are storing the model.

8094
05:53:42,253 --> 05:53:45,871
So basically you can see these
are the fields from the file,

8095
05:53:45,871 --> 05:53:47,372
which I have shown you.

8096
05:53:47,372 --> 05:53:49,200
So we have transaction date.

8097
05:53:49,200 --> 05:53:53,600
We have product price payment
type name city state country

8098
05:53:53,600 --> 05:53:57,700
and so on so we have created
variable for each field.

8099
05:53:57,700 --> 05:54:01,101
So what we are doing we
are basically creating a getter

8100
05:54:01,101 --> 05:54:03,766
and Setter function for
all these variables.

8101
05:54:03,766 --> 05:54:05,702
So we have get transaction ID,

8102
05:54:05,702 --> 05:54:08,800
which will basically
returned Transaction ID then

8103
05:54:08,800 --> 05:54:10,600
we have sent transaction ID,

8104
05:54:10,600 --> 05:54:13,300
which will basically
send the transaction ID.

8105
05:54:13,300 --> 05:54:13,809
Similarly.

8106
05:54:13,809 --> 05:54:17,036
We have get transaction date for
getting the transaction date.

8107
05:54:17,036 --> 05:54:19,100
Then we have set
transaction date and it

8108
05:54:19,100 --> 05:54:21,900
will set the transaction date
using this variable.

8109
05:54:21,900 --> 05:54:25,532
Then we have get products
and product get price set price

8110
05:54:25,532 --> 05:54:26,700
and all the getter

8111
05:54:26,700 --> 05:54:29,900
and Setter methods
for each of the variable.

8112
05:54:32,000 --> 05:54:34,000
This is the Constructor.

8113
05:54:34,100 --> 05:54:35,615
So here we are taking

8114
05:54:35,615 --> 05:54:39,513
all the parameters like
transaction date product price.

8115
05:54:39,513 --> 05:54:42,295
And then we are setting
the value of each

8116
05:54:42,295 --> 05:54:44,800
of the variables
using this operator.

8117
05:54:44,800 --> 05:54:48,295
So we are setting the value for
transaction date product price

8118
05:54:48,295 --> 05:54:51,500
payment and all of the fields
that is present over there.

8119
05:54:51,515 --> 05:54:51,900
Next.

8120
05:54:51,900 --> 05:54:55,053
We are also creating
a default Constructor

8121
05:54:55,200 --> 05:54:56,616
and then over here.

8122
05:54:56,616 --> 05:54:59,300
We are overriding
the tostring method

8123
05:54:59,300 --> 05:55:01,600
and what we are doing
we are basically

8124
05:55:02,400 --> 05:55:04,500
The transaction details

8125
05:55:04,500 --> 05:55:06,600
and we are
returning transaction date

8126
05:55:06,600 --> 05:55:09,100
and then the value
of transaction date product

8127
05:55:09,100 --> 05:55:12,300
then body of product price
then value of price

8128
05:55:12,300 --> 05:55:14,900
and so on for all the fields.

8129
05:55:15,300 --> 05:55:18,800
So basically this is the model
of the transaction

8130
05:55:18,800 --> 05:55:20,000
so we can go ahead

8131
05:55:20,000 --> 05:55:22,529
and we can create object
of this transaction

8132
05:55:22,529 --> 05:55:24,400
and then we can easily go ahead

8133
05:55:24,400 --> 05:55:27,700
and send the transaction
object as the value.

8134
05:55:27,700 --> 05:55:29,900
So this is the main
reason of creating

8135
05:55:29,900 --> 05:55:31,588
this transaction model, LOL.

8136
05:55:31,588 --> 05:55:34,000
Me quickly, go ahead
and close this file.

8137
05:55:34,000 --> 05:55:38,400
Let's go back and let's first
take a look at this config.

8138
05:55:38,615 --> 05:55:41,384
So this is Kafka
properties dot Java.

8139
05:55:41,500 --> 05:55:43,202
So what we did again

8140
05:55:43,202 --> 05:55:46,894
as I have shown you
the application dot yml file.

8141
05:55:46,942 --> 05:55:48,500
So we have taken all

8142
05:55:48,500 --> 05:55:51,500
the parameters that we
have specified over there.

8143
05:55:51,600 --> 05:55:54,600
That is your bootstrap
product topic partition count

8144
05:55:54,600 --> 05:55:57,700
then Brokers filename
and thread count.

8145
05:55:57,700 --> 05:55:59,322
So all these properties

8146
05:55:59,322 --> 05:56:02,367
then you have file path
then all these Days,

8147
05:56:02,367 --> 05:56:04,300
we have taken we have created

8148
05:56:04,300 --> 05:56:07,100
a variable and then
what we are doing again,

8149
05:56:07,100 --> 05:56:08,700
we are doing the same thing

8150
05:56:08,700 --> 05:56:11,039
as we did with
our transaction model.

8151
05:56:11,039 --> 05:56:12,600
We are creating a getter

8152
05:56:12,600 --> 05:56:15,247
and Setter method for each
of these variables.

8153
05:56:15,247 --> 05:56:17,305
So you can see we
have get file path

8154
05:56:17,305 --> 05:56:19,300
and we are returning
the file path.

8155
05:56:19,300 --> 05:56:20,924
Then we have set file path

8156
05:56:20,924 --> 05:56:24,300
where we are setting the file
path using this operator.

8157
05:56:24,300 --> 05:56:24,800
Similarly.

8158
05:56:24,800 --> 05:56:26,600
We have get product topics

8159
05:56:26,600 --> 05:56:29,567
at product topic then we
have greater incentive

8160
05:56:29,567 --> 05:56:30,400
for third count.

8161
05:56:30,400 --> 05:56:31,700
We have greater incentive.

8162
05:56:31,700 --> 05:56:36,000
for bootstrap and all
those properties No,

8163
05:56:36,100 --> 05:56:37,522
we can again go ahead

8164
05:56:37,522 --> 05:56:40,300
and call this cough
cough properties anywhere

8165
05:56:40,300 --> 05:56:41,400
and then we can easily

8166
05:56:41,400 --> 05:56:44,000
extract those values
using getter methods.

8167
05:56:44,100 --> 05:56:48,400
So let me quickly close
this file and I'll take you

8168
05:56:48,400 --> 05:56:50,500
to the configurations.

8169
05:56:50,900 --> 05:56:52,100
So in this configuration

8170
05:56:52,100 --> 05:56:54,700
what we are doing we
are creating the object

8171
05:56:54,700 --> 05:56:56,700
of Kafka properties
as you can see,

8172
05:56:57,000 --> 05:56:59,800
so what we are doing then we
are again creating a property's

8173
05:56:59,800 --> 05:57:02,600
object and then we
are setting the properties

8174
05:57:02,700 --> 05:57:03,800
so you can see

8175
05:57:03,800 --> 05:57:06,800
that we are Setting
the bootstrap server config

8176
05:57:06,800 --> 05:57:08,400
and then we are retrieving

8177
05:57:08,400 --> 05:57:11,900
the value using the cough
cough properties object.

8178
05:57:11,900 --> 05:57:14,300
And this is the get
bootstrap server function.

8179
05:57:14,300 --> 05:57:17,500
Then you can see we are setting
the acknowledgement config

8180
05:57:17,500 --> 05:57:18,400
and we are getting

8181
05:57:18,400 --> 05:57:22,100
the acknowledgement from this
get acknowledgement function.

8182
05:57:22,100 --> 05:57:24,900
And then we are using
this get rate rise method.

8183
05:57:24,900 --> 05:57:27,300
So from all these
Kafka properties object.

8184
05:57:27,300 --> 05:57:29,000
We are calling
those getter methods

8185
05:57:29,000 --> 05:57:30,700
and retrieving those values

8186
05:57:30,700 --> 05:57:34,100
and setting those values
in this property object.

8187
05:57:34,100 --> 05:57:36,900
So We have partitioner class.

8188
05:57:37,000 --> 05:57:40,294
So we are basically implementing
this default partitioner

8189
05:57:40,294 --> 05:57:41,400
which is present in

8190
05:57:41,400 --> 05:57:45,700
over G. Apache car park client
producer internals package.

8191
05:57:45,700 --> 05:57:48,600
Then we are creating
a producer over here

8192
05:57:48,600 --> 05:57:50,756
and we are passing this props

8193
05:57:50,756 --> 05:57:54,400
object which will set
the properties so over here.

8194
05:57:54,400 --> 05:57:56,684
We are passing
the key serializer,

8195
05:57:56,684 --> 05:57:58,900
which is the
string T serializer.

8196
05:57:58,900 --> 05:58:00,100
And then this is

8197
05:58:00,100 --> 05:58:04,400
the value serializer in which
we are creating new customer.

8198
05:58:04,400 --> 05:58:07,500
Distance Eliezer and then
we are passing transaction

8199
05:58:07,500 --> 05:58:10,400
over here and then it
will return the producer

8200
05:58:10,500 --> 05:58:13,735
and then we are implementing
thread we are again getting

8201
05:58:13,735 --> 05:58:15,200
the get minimum pool size

8202
05:58:15,200 --> 05:58:17,700
from Kafka properties and get
maximum pool size

8203
05:58:17,700 --> 05:58:18,700
from Kafka property.

8204
05:58:18,700 --> 05:58:19,600
So we're here.

8205
05:58:19,600 --> 05:58:22,000
We are implementing
Java threads now.

8206
05:58:22,000 --> 05:58:25,534
Let me quickly close this cough
pop producer configuration

8207
05:58:25,534 --> 05:58:28,200
where we are configuring
our Kafka producer.

8208
05:58:28,461 --> 05:58:29,538
Let's go back.

8209
05:58:30,400 --> 05:58:32,800
Let's quickly go to this API

8210
05:58:32,946 --> 05:58:36,253
which have event producer
EPA dot Java file.

8211
05:58:36,300 --> 05:58:40,130
So here we are basically
creating an event producer API

8212
05:58:40,130 --> 05:58:42,400
which has this
dispatch function.

8213
05:58:42,400 --> 05:58:46,900
So we'll use this dispatch
function to send the records.

8214
05:58:47,180 --> 05:58:49,719
So let me quickly
close this file.

8215
05:58:50,061 --> 05:58:51,138
Let's go back.

8216
05:58:51,300 --> 05:58:53,475
We have already seen this config

8217
05:58:53,475 --> 05:58:54,700
and configurations

8218
05:58:54,700 --> 05:58:57,100
in which we are basically
retrieving those values

8219
05:58:57,100 --> 05:58:58,984
from application dot yml file

8220
05:58:58,984 --> 05:59:02,300
and then we are Setting
the producer configurations,

8221
05:59:02,300 --> 05:59:04,000
then we have constants.

8222
05:59:04,000 --> 05:59:07,100
So in Kafka constants or Java,

8223
05:59:07,200 --> 05:59:09,900
we have created this Kafka
constant interface

8224
05:59:09,900 --> 05:59:11,393
where we have specified

8225
05:59:11,393 --> 05:59:14,925
the batch size account limit
check some limit then read

8226
05:59:14,925 --> 05:59:17,494
batch size minimum
balance maximum balance

8227
05:59:17,494 --> 05:59:19,500
minimum account maximum account.

8228
05:59:19,500 --> 05:59:22,604
Then we are also implementing
daytime for matter.

8229
05:59:22,604 --> 05:59:25,643
So we are specifying all
the constants over here.

8230
05:59:25,643 --> 05:59:27,100
Let me close this file.

8231
05:59:27,100 --> 05:59:31,300
Let's go back then this is
Manso will not look

8232
05:59:31,300 --> 05:59:32,506
at these two files,

8233
05:59:32,506 --> 05:59:35,300
but let me tell you what
does these two files

8234
05:59:35,300 --> 05:59:39,400
to these two files are
basically to record the metrics

8235
05:59:39,400 --> 05:59:42,000
of your Kafka like time in which

8236
05:59:42,000 --> 05:59:44,889
your thousand records have
been produced in cough power.

8237
05:59:44,889 --> 05:59:45,781
You can say time

8238
05:59:45,781 --> 05:59:48,400
in which records
are getting published to Kafka.

8239
05:59:48,400 --> 05:59:51,936
It will be monitored and then
you can record those starts.

8240
05:59:51,936 --> 05:59:53,292
So basically it helps

8241
05:59:53,292 --> 05:59:57,100
in optimizing the performance
of your Kafka producer, right?

8242
05:59:57,100 --> 05:59:59,863
You can actually know
how to do Recon.

8243
05:59:59,863 --> 06:00:03,000
How to add just
those configuration factors

8244
06:00:03,000 --> 06:00:05,041
and then you can
see the difference

8245
06:00:05,041 --> 06:00:07,159
or you can actually
monitor the stats

8246
06:00:07,159 --> 06:00:08,259
and then understand

8247
06:00:08,259 --> 06:00:11,612
or how you can actually make
your producer more efficient.

8248
06:00:11,612 --> 06:00:13,039
So these are basically

8249
06:00:13,039 --> 06:00:16,800
for those factors but let's
not worry about this right now.

8250
06:00:16,900 --> 06:00:18,600
Let's go back next.

8251
06:00:18,600 --> 06:00:21,500
Let me quickly take you
through this file utility.

8252
06:00:21,500 --> 06:00:24,000
So you have file
you treated or Java.

8253
06:00:24,000 --> 06:00:26,600
So basically what we
are doing over here,

8254
06:00:26,600 --> 06:00:28,550
we are reading each record

8255
06:00:28,550 --> 06:00:32,200
from the file we using
For reader so over here,

8256
06:00:32,200 --> 06:00:36,900
you can see we have this list
and then we have bufferedreader.

8257
06:00:36,900 --> 06:00:38,700
Then we have file reader.

8258
06:00:38,700 --> 06:00:41,000
So first we are reading the file

8259
06:00:41,000 --> 06:00:44,105
and then we are trying
to split each of the fields

8260
06:00:44,105 --> 06:00:45,500
present in the record.

8261
06:00:45,500 --> 06:00:49,500
And then we are setting the
value of those fields over here.

8262
06:00:49,700 --> 06:00:52,407
Then we are specifying
some of the exceptions

8263
06:00:52,407 --> 06:00:54,900
that may occur like
number format exception

8264
06:00:54,900 --> 06:00:57,500
or pass exception all
those kind of exception

8265
06:00:57,500 --> 06:01:00,900
we have specified over here
and then we are Closing this

8266
06:01:00,900 --> 06:01:01,959
so in this file.

8267
06:01:01,959 --> 06:01:04,746
We are basically
reading the records now.

8268
06:01:04,746 --> 06:01:06,000
Let me close this.

8269
06:01:06,000 --> 06:01:07,100
Let's go back.

8270
06:01:07,500 --> 06:01:07,766
Now.

8271
06:01:07,766 --> 06:01:10,500
Let's take a quick look
at the seal lizer.

8272
06:01:10,500 --> 06:01:13,100
So this is custom
Jason serializer.

8273
06:01:13,500 --> 06:01:15,100
So in serializer,

8274
06:01:15,100 --> 06:01:18,000
we have created
a custom decency réaliser.

8275
06:01:18,000 --> 06:01:22,023
Now, this is basically
to write the values as bites.

8276
06:01:22,100 --> 06:01:26,082
So the data which you will be
passing will be written in bytes

8277
06:01:26,082 --> 06:01:27,197
because as we know

8278
06:01:27,197 --> 06:01:29,800
that data is sent to Kafka
and form of pie.

8279
06:01:29,800 --> 06:01:32,000
And this is the reason
why we have created

8280
06:01:32,000 --> 06:01:33,700
this custom Jason serializer.

8281
06:01:33,930 --> 06:01:37,469
So now let me quickly close
this let's go back.

8282
06:01:37,700 --> 06:01:41,800
This file is basically for
my spring boot web application.

8283
06:01:41,900 --> 06:01:44,200
So let's not get into this.

8284
06:01:44,300 --> 06:01:47,100
Let's look at events
Red Dot Java.

8285
06:01:47,865 --> 06:01:51,634
So basically over here we
have event producer API.

8286
06:01:52,300 --> 06:01:57,100
So now we are trying to dispatch
those events and to show you

8287
06:01:57,100 --> 06:01:58,988
how dispatch function works.

8288
06:01:58,988 --> 06:02:00,000
Let me go back.

8289
06:02:00,000 --> 06:02:01,691
Let me open services

8290
06:02:01,700 --> 06:02:05,000
and even producer
I MPL is implementation.

8291
06:02:05,000 --> 06:02:08,100
So let me show you
how this dispatch works.

8292
06:02:08,100 --> 06:02:10,400
So basically over here
what we are doing first.

8293
06:02:10,400 --> 06:02:11,576
We are initializing.

8294
06:02:11,576 --> 06:02:13,047
So using the file utility.

8295
06:02:13,047 --> 06:02:16,000
We are basically reading
the files and read the file.

8296
06:02:16,000 --> 06:02:19,356
We are getting the path using
this Kafka properties object

8297
06:02:19,356 --> 06:02:22,300
and we are calling
this getter method of file path.

8298
06:02:22,300 --> 06:02:24,900
Then what we are doing
we are basically taking

8299
06:02:24,900 --> 06:02:25,900
the product list

8300
06:02:25,900 --> 06:02:28,700
and then we are trying
to dispatch it so

8301
06:02:28,700 --> 06:02:32,800
in dispatch Are basically
using Kafka producer

8302
06:02:33,600 --> 06:02:37,000
and then we are creating the
object of the producer record.

8303
06:02:37,000 --> 06:02:41,594
Then we are using the get topic
from this calf pad properties.

8304
06:02:41,594 --> 06:02:44,004
We are getting
this transaction ID

8305
06:02:44,004 --> 06:02:45,459
from the transaction

8306
06:02:45,459 --> 06:02:49,540
and then we are using event
producer send to send the data.

8307
06:02:49,540 --> 06:02:51,300
And finally we are trying

8308
06:02:51,300 --> 06:02:54,827
to monitor this but let's
not worry about the monitoring

8309
06:02:54,827 --> 06:02:57,200
and cash the monitoring
and start spot

8310
06:02:57,200 --> 06:02:59,661
so we can ignore this part Nets.

8311
06:02:59,800 --> 06:03:03,700
Let's quickly go back
and look at the last file

8312
06:03:03,700 --> 06:03:05,100
which is producer.

8313
06:03:05,600 --> 06:03:07,835
So let me show you
this event producer.

8314
06:03:07,835 --> 06:03:09,300
So what we are doing here,

8315
06:03:09,300 --> 06:03:11,500
we are actually
creating a logger.

8316
06:03:11,900 --> 06:03:13,500
So in this on completion method,

8317
06:03:13,500 --> 06:03:16,300
we are basically passing
the record metadata.

8318
06:03:16,300 --> 06:03:20,838
And if your e-except shin is
not null then it will basically

8319
06:03:20,838 --> 06:03:25,200
throw an error saying this
and recorded metadata else.

8320
06:03:25,400 --> 06:03:29,700
It will give you the send
message to topic partition.

8321
06:03:29,700 --> 06:03:32,300
All set and then
the record metadata

8322
06:03:32,300 --> 06:03:34,564
and topic and then it will give

8323
06:03:34,564 --> 06:03:38,800
you all the details regarding
topic partitions and offsets.

8324
06:03:38,800 --> 06:03:40,888
So I hope that you
guys have understood

8325
06:03:40,888 --> 06:03:44,110
how this cough cough producer
is working now is the time we

8326
06:03:44,110 --> 06:03:47,169
need to go ahead and we need
to quickly execute this.

8327
06:03:47,169 --> 06:03:49,200
So let me open
a terminal over here.

8328
06:03:49,500 --> 06:03:51,653
No first build this project.

8329
06:03:51,653 --> 06:03:54,423
We need to execute
mvn clean install.

8330
06:03:54,900 --> 06:03:56,800
This will install
all the dependencies.

8331
06:04:01,600 --> 06:04:04,100
So as you can see
our build is successful.

8332
06:04:04,100 --> 06:04:08,111
So let me minimize this and
this target directory is created

8333
06:04:08,111 --> 06:04:10,394
after you build
a wave in project.

8334
06:04:10,394 --> 06:04:11,778
So let me quickly go

8335
06:04:11,778 --> 06:04:16,000
inside this target directory and
this is the root dot bar file

8336
06:04:16,000 --> 06:04:18,300
that is root dot
web archive file

8337
06:04:18,300 --> 06:04:19,897
which we need to execute.

8338
06:04:19,897 --> 06:04:22,900
So let's quickly go ahead
and execute this file.

8339
06:04:23,100 --> 06:04:24,755
But before this to verify

8340
06:04:24,755 --> 06:04:27,800
whether the data
is getting produced in our car

8341
06:04:27,800 --> 06:04:29,900
for topics so for testing

8342
06:04:29,900 --> 06:04:33,300
as I already told you
We need to go ahead

8343
06:04:33,300 --> 06:04:36,200
and we need to open
a console consumer

8344
06:04:36,500 --> 06:04:37,500
so that we can check

8345
06:04:37,500 --> 06:04:40,200
that whether data
is getting published or not.

8346
06:04:42,400 --> 06:04:45,100
So let me quickly minimize this.

8347
06:04:48,300 --> 06:04:52,700
So let's quickly go to
Kafka directory and the command

8348
06:04:52,700 --> 06:04:59,300
is dot slash bin Kafka
console consumer and then -

8349
06:04:59,300 --> 06:05:01,500
- bootstrap server.

8350
06:05:14,800 --> 06:05:21,964
nine zero nine two Okay,
I'll let me check the topic.

8351
06:05:21,964 --> 06:05:23,271
What's the topic?

8352
06:05:24,000 --> 06:05:27,000
Let's go to our
application dot yml file.

8353
06:05:27,000 --> 06:05:31,000
So the topic
name is transaction.

8354
06:05:31,000 --> 06:05:35,100
Let me quickly minimize
this specify the topic name

8355
06:05:35,100 --> 06:05:36,500
and I'll hit enter.

8356
06:05:36,500 --> 06:05:41,300
So now let me place
this console aside.

8357
06:05:41,300 --> 06:05:45,900
And now let's quickly go ahead
and execute our project.

8358
06:05:45,900 --> 06:05:49,400
So for that
the command is Java -

8359
06:05:49,400 --> 06:05:52,938
jar and then we'll provide
the path of the file

8360
06:05:52,938 --> 06:05:54,100
that is inside.

8361
06:05:54,300 --> 06:05:59,700
Great, and the file is
rude dot war and here we go.

8362
06:06:18,100 --> 06:06:20,955
So now you can see
in the console consumer.

8363
06:06:20,955 --> 06:06:23,200
The records are
getting published.

8364
06:06:23,200 --> 06:06:23,700
Right?

8365
06:06:24,000 --> 06:06:25,903
So there are multiple records

8366
06:06:25,903 --> 06:06:29,118
which have been published
in our transaction topic

8367
06:06:29,118 --> 06:06:32,400
and you can verify this
using the console consumer.

8368
06:06:32,400 --> 06:06:33,145
So this is

8369
06:06:33,145 --> 06:06:36,500
where the developers use
the console consumer.

8370
06:06:38,000 --> 06:06:40,980
So now we have successfully
verified our producer.

8371
06:06:40,980 --> 06:06:43,900
So let me quickly go ahead
and stop the producer.

8372
06:06:45,500 --> 06:06:48,200
Lat, let me stop
consumer as well.

8373
06:06:49,400 --> 06:06:51,370
Let's quickly minimize this

8374
06:06:51,370 --> 06:06:54,144
and now let's go
to the second project.

8375
06:06:54,144 --> 06:06:56,700
That is Park
streaming Kafka Master.

8376
06:06:56,900 --> 06:06:57,200
Again.

8377
06:06:57,200 --> 06:06:59,667
We have specified
all the dependencies

8378
06:06:59,667 --> 06:07:00,800
that is required.

8379
06:07:01,000 --> 06:07:03,700
Let me quickly show
you those dependencies.

8380
06:07:07,700 --> 06:07:09,800
Now again, you
can see were here.

8381
06:07:09,800 --> 06:07:12,400
We have specified
Java version then we

8382
06:07:12,400 --> 06:07:16,600
have specified the artifacts
or you can see the dependencies.

8383
06:07:16,796 --> 06:07:18,796
So we have Scala compiler.

8384
06:07:18,796 --> 06:07:21,411
Then we have
spark streaming Kafka.

8385
06:07:21,900 --> 06:07:24,200
Then we have
cough cough clients.

8386
06:07:24,400 --> 06:07:28,400
Then Json data binding then we
have Maven compiler plug-in.

8387
06:07:28,400 --> 06:07:30,600
So all those dependencies
which are required.

8388
06:07:30,600 --> 06:07:32,300
We are specified over here.

8389
06:07:32,500 --> 06:07:35,500
So let me quickly go
ahead and close it.

8390
06:07:36,200 --> 06:07:40,503
Let's quickly move to the source
directory main then let's look

8391
06:07:40,503 --> 06:07:42,100
at the resources again.

8392
06:07:42,203 --> 06:07:44,896
So this is application
dot yml file.

8393
06:07:45,700 --> 06:07:46,700
So we have put

8394
06:07:46,700 --> 06:07:49,600
eight zero eight zero then we
have bootstrap server over here.

8395
06:07:49,600 --> 06:07:51,100
Then we have proven over here.

8396
06:07:51,100 --> 06:07:53,200
Then we have topic
is as transaction.

8397
06:07:53,200 --> 06:07:56,000
The group is transaction
partition count is one

8398
06:07:56,000 --> 06:07:57,273
and then the file name

8399
06:07:57,273 --> 06:07:59,664
so we won't be using
this file name then.

8400
06:07:59,664 --> 06:08:01,900
Let me quickly go ahead
and close this.

8401
06:08:01,900 --> 06:08:02,984
Let's go back.

8402
06:08:02,984 --> 06:08:06,600
Let's go back to Java
directory comms Park demo,

8403
06:08:06,600 --> 06:08:08,200
then this is the model.

8404
06:08:08,200 --> 06:08:10,100
So it's same

8405
06:08:10,600 --> 06:08:13,011
so these are all the fields
that are there

8406
06:08:13,011 --> 06:08:15,800
in the transaction
you have transaction.

8407
06:08:15,800 --> 06:08:18,100
Eight product price payment type

8408
06:08:18,100 --> 06:08:22,500
the name city state country
account created and so on.

8409
06:08:22,500 --> 06:08:25,100
And again, we have
specified all the getter

8410
06:08:25,100 --> 06:08:29,285
and Setter methods over here
and similarly again,

8411
06:08:29,285 --> 06:08:32,600
we have created
this transaction dto Constructor

8412
06:08:32,600 --> 06:08:34,900
where we have taken
all the parameters

8413
06:08:34,900 --> 06:08:38,200
and then we have setting
the values using this operator.

8414
06:08:38,200 --> 06:08:39,100
Next.

8415
06:08:39,100 --> 06:08:42,400
We are again over adding
this tostring function

8416
06:08:42,400 --> 06:08:43,414
and over here.

8417
06:08:43,414 --> 06:08:47,500
We are again returning the
details like transaction date

8418
06:08:47,500 --> 06:08:49,700
and then vario
transaction date product

8419
06:08:49,700 --> 06:08:53,200
and then value of product
and similarly all the fields.

8420
06:08:53,411 --> 06:08:55,488
So let me close this model.

8421
06:08:55,900 --> 06:08:57,100
Let's go back.

8422
06:08:57,200 --> 06:09:00,500
Let's look at cough covers,
then we are see realizer.

8423
06:09:00,500 --> 06:09:02,294
So this is the Jason serializer

8424
06:09:02,294 --> 06:09:06,187
which was there in our producer
and this is transaction decoder.

8425
06:09:06,187 --> 06:09:07,300
Let's take a look.

8426
06:09:07,780 --> 06:09:09,319
Now you have decoder

8427
06:09:09,400 --> 06:09:12,600
which is again implementing
decoder and we're passing

8428
06:09:12,600 --> 06:09:14,800
this transaction dto then again,

8429
06:09:14,800 --> 06:09:17,339
you can see we This problem
by its method

8430
06:09:17,339 --> 06:09:18,800
which we are overriding

8431
06:09:18,800 --> 06:09:22,022
and we are reading
the values using this bites

8432
06:09:22,022 --> 06:09:24,600
and then transaction
DDO class again,

8433
06:09:24,600 --> 06:09:28,600
if it is failing to pass we are
giving Json processing failed

8434
06:09:28,600 --> 06:09:29,799
for object this

8435
06:09:30,200 --> 06:09:31,573
and you can see we have

8436
06:09:31,573 --> 06:09:34,200
this transaction decoder
construct over here.

8437
06:09:34,200 --> 06:09:37,200
So let me quickly
again close this file.

8438
06:09:37,200 --> 06:09:38,892
Let's quickly go back.

8439
06:09:39,400 --> 06:09:42,500
And now let's take a look
at spot streaming app

8440
06:09:42,500 --> 06:09:44,200
where basically the data

8441
06:09:44,200 --> 06:09:48,100
which the producer project
will be producing to cough cough

8442
06:09:48,100 --> 06:09:51,900
will be actually consumed by
spark streaming application.

8443
06:09:51,900 --> 06:09:55,071
So spark streaming will stream
the data in real time

8444
06:09:55,071 --> 06:09:57,000
and then will display the data.

8445
06:09:57,000 --> 06:09:59,600
So in this park
streaming application,

8446
06:09:59,600 --> 06:10:03,189
we are creating conf object
and then we are setting

8447
06:10:03,189 --> 06:10:05,900
the application name
as cough by sandbox.

8448
06:10:05,900 --> 06:10:09,331
The master is local star
then we have Java.

8449
06:10:09,331 --> 06:10:13,100
Fog contest so here we
are specifying the spark context

8450
06:10:13,100 --> 06:10:16,700
and then next we are specifying
the Java streaming context.

8451
06:10:16,700 --> 06:10:18,500
So this object will basically

8452
06:10:18,500 --> 06:10:21,100
we used to take
the streaming data.

8453
06:10:21,100 --> 06:10:25,946
So we are passing this Java Spa
context over here as a parameter

8454
06:10:25,946 --> 06:10:29,900
and then we are specifying
the duration that is 2000.

8455
06:10:29,900 --> 06:10:30,200
Next.

8456
06:10:30,200 --> 06:10:32,600
We have Kafka parameters
should to connect

8457
06:10:32,600 --> 06:10:35,555
to Kafka you need
to specify this parameters.

8458
06:10:35,555 --> 06:10:37,100
So in Kafka parameters,

8459
06:10:37,100 --> 06:10:39,500
we are specifying
The Meta broken.

8460
06:10:39,500 --> 06:10:44,292
Why's that is localized 9:09 to
then we have Auto offset resent

8461
06:10:44,292 --> 06:10:45,600
that is smallest.

8462
06:10:45,600 --> 06:10:49,200
Then in topics the name
of the topic from which we

8463
06:10:49,200 --> 06:10:53,300
will be consuming messages
is transaction next Java.

8464
06:10:53,300 --> 06:10:56,200
We're creating a Java
pair input D streams.

8465
06:10:56,200 --> 06:10:59,300
So basically this D stream
is discrete stream,

8466
06:10:59,300 --> 06:11:02,300
which is the basic abstraction
of spark streaming

8467
06:11:02,300 --> 06:11:04,290
and is a continuous sequence

8468
06:11:04,290 --> 06:11:07,104
of rdds representing
a continuous stream

8469
06:11:07,104 --> 06:11:11,200
of data now the stream can I
The created from live data

8470
06:11:11,200 --> 06:11:13,000
from Kafka hdfs of Flume

8471
06:11:13,000 --> 06:11:14,457
or it can be generated

8472
06:11:14,457 --> 06:11:17,900
from transforming existing be
streams using operation

8473
06:11:17,900 --> 06:11:18,828
to over here.

8474
06:11:18,828 --> 06:11:21,700
We are again creating
a Java input D stream.

8475
06:11:21,700 --> 06:11:24,700
We are passing string
and transaction DTS parameters

8476
06:11:24,700 --> 06:11:27,504
and we are creating
direct Kafka stream object.

8477
06:11:27,504 --> 06:11:29,700
Then we're using
this Kafka you tails

8478
06:11:29,700 --> 06:11:33,000
and we are calling
the method create direct stream

8479
06:11:33,000 --> 06:11:35,885
where we are passing
the parameters as SSC

8480
06:11:35,885 --> 06:11:38,700
that is your spark
streaming context then

8481
06:11:38,700 --> 06:11:40,341
you have String dot class

8482
06:11:40,341 --> 06:11:42,829
which is basically
your key serializer.

8483
06:11:42,829 --> 06:11:45,322
Then transaction video
does not class

8484
06:11:45,322 --> 06:11:46,500
that is basically

8485
06:11:46,500 --> 06:11:49,700
your value serializer
then string decoder

8486
06:11:49,700 --> 06:11:52,868
that is to decode your key
and then transaction

8487
06:11:52,868 --> 06:11:55,900
decoded basically to
decode your transaction.

8488
06:11:55,900 --> 06:11:57,784
Then you have Kafka parameters,

8489
06:11:57,784 --> 06:11:59,501
which you have created here

8490
06:11:59,501 --> 06:12:02,300
where you have specified
broken list and auto

8491
06:12:02,300 --> 06:12:05,900
offset reset and then you
are specifying the topics

8492
06:12:05,900 --> 06:12:10,500
which is your transaction so
next using this Cordy stream,

8493
06:12:10,500 --> 06:12:14,000
you're actually continuously
iterating over the rdd

8494
06:12:14,000 --> 06:12:17,345
and then you are trying
to print your new rdd

8495
06:12:17,345 --> 06:12:19,400
with then already partition

8496
06:12:19,400 --> 06:12:21,200
and size then rdd count

8497
06:12:21,200 --> 06:12:24,600
and the record so already
for each record.

8498
06:12:24,900 --> 06:12:26,400
So you are printing the record

8499
06:12:26,500 --> 06:12:30,400
and then you are starting
these Park streaming context

8500
06:12:30,400 --> 06:12:32,800
and then you are waiting
for the termination.

8501
06:12:32,800 --> 06:12:35,500
So this is the spark
streaming application.

8502
06:12:35,500 --> 06:12:39,200
So let's first quickly go ahead
and execute this application.

8503
06:12:39,200 --> 06:12:40,900
They've been close this file.

8504
06:12:41,000 --> 06:12:43,400
Let's go to the source.

8505
06:12:44,900 --> 06:12:49,000
Now, let me quickly go ahead and
delete this target directory.

8506
06:12:49,000 --> 06:12:53,615
So now let me quickly open the
terminal MV and clean install.

8507
06:12:58,400 --> 06:13:01,800
So now as you can see the target
directory is again created

8508
06:13:01,800 --> 06:13:05,307
and this park streaming Kafka
snapshot jar is created.

8509
06:13:05,307 --> 06:13:07,300
So we need to execute this jar.

8510
06:13:07,700 --> 06:13:10,800
So let me quickly go ahead
and minimize it.

8511
06:13:12,500 --> 06:13:14,300
Let me close this terminal.

8512
06:13:14,400 --> 06:13:18,000
So now first I'll start
this pop streaming job.

8513
06:13:18,600 --> 06:13:24,100
So the command is Java -
jar inside the target directory.

8514
06:13:24,600 --> 06:13:31,500
We have this spark streaming of
college are so let's hit enter.

8515
06:13:34,500 --> 06:13:38,100
So let me know quickly go ahead
and start producing messages.

8516
06:13:41,000 --> 06:13:44,100
So I will minimize this and I
will wait for the messages.

8517
06:13:50,019 --> 06:13:53,480
So let me quickly close
this pot streaming job

8518
06:13:53,600 --> 06:13:56,900
and then I will show
you the consumed records

8519
06:13:59,000 --> 06:14:00,400
so you can see the record

8520
06:14:00,400 --> 06:14:02,673
that is consumed
from spark streaming.

8521
06:14:02,673 --> 06:14:05,500
So here you have got record
and transaction dto

8522
06:14:05,500 --> 06:14:08,561
and then transaction date
products all the details,

8523
06:14:08,561 --> 06:14:09,969
which we are specified.

8524
06:14:09,969 --> 06:14:11,500
You can see it over here.

8525
06:14:11,500 --> 06:14:15,400
So this is how spark
streaming works with Kafka now,

8526
06:14:15,400 --> 06:14:17,600
it's just a basic job again.

8527
06:14:17,600 --> 06:14:20,900
You can go ahead and you
can take Those transaction you

8528
06:14:20,900 --> 06:14:23,651
can perform some real-time
analytics over there

8529
06:14:23,651 --> 06:14:27,406
and then you can go ahead and
write those results so over here

8530
06:14:27,406 --> 06:14:29,500
we have just given
you a basic demo

8531
06:14:29,500 --> 06:14:32,401
in which we are producing
the records to Kafka

8532
06:14:32,401 --> 06:14:34,400
and then using spark streaming.

8533
06:14:34,400 --> 06:14:37,533
We are streaming those records
from Kafka again.

8534
06:14:37,533 --> 06:14:38,600
You can go ahead

8535
06:14:38,600 --> 06:14:41,083
and you can perform
multiple Transformations

8536
06:14:41,083 --> 06:14:42,848
over the data multiple actions

8537
06:14:42,848 --> 06:14:45,500
and produce some real-time
results using this data.

8538
06:14:45,500 --> 06:14:48,975
So this is just a basic demo
where we have shown you

8539
06:14:48,975 --> 06:14:51,700
how to basically
produce recalls to Kafka

8540
06:14:51,700 --> 06:14:55,000
and then consume those records
using spark streaming.

8541
06:14:55,000 --> 06:14:57,846
So let's quickly go
back to our slide.

8542
06:14:58,600 --> 06:15:00,526
Now as this was a basic project.

8543
06:15:00,526 --> 06:15:01,669
Let me explain you

8544
06:15:01,669 --> 06:15:04,390
one of the cough
by spark streaming project,

8545
06:15:04,390 --> 06:15:05,754
which is a Ted Eureka.

8546
06:15:05,754 --> 06:15:09,100
So basically there is a company
called Tech review.com.

8547
06:15:09,100 --> 06:15:11,900
So this take review.com
basically provide reviews

8548
06:15:11,900 --> 06:15:14,481
for your recent
and different Technologies,

8549
06:15:14,481 --> 06:15:17,800
like a smart watches phones
different operating systems

8550
06:15:17,800 --> 06:15:20,100
and anything new
that is coming into Market.

8551
06:15:20,100 --> 06:15:23,409
So what happens is the company
decided to include a new feature

8552
06:15:23,409 --> 06:15:26,883
which will basically allow
users to compare the popularity

8553
06:15:26,883 --> 06:15:29,200
or trend of multiple
Technologies based

8554
06:15:29,200 --> 06:15:32,400
on the Twitter feeds
and second for the USP.

8555
06:15:32,400 --> 06:15:33,500
They are basically

8556
06:15:33,500 --> 06:15:36,200
trying this comparison
to happen in real time.

8557
06:15:36,200 --> 06:15:38,788
So basically they
have assigned you this task

8558
06:15:38,788 --> 06:15:41,299
so that you have to go
ahead you have to take

8559
06:15:41,299 --> 06:15:42,752
the real-time Twitter feeds

8560
06:15:42,752 --> 06:15:45,400
then you have to show
the real time comparison

8561
06:15:45,400 --> 06:15:46,900
of various Technologies.

8562
06:15:46,900 --> 06:15:50,500
So again, the company is
is asking you to to identify

8563
06:15:50,500 --> 06:15:51,684
the minute literate

8564
06:15:51,684 --> 06:15:55,500
between different Technologies
by consuming Twitter streams

8565
06:15:55,500 --> 06:15:58,900
and writing aggregated minute
Li count to Cassandra

8566
06:15:58,900 --> 06:16:00,200
from where again -

8567
06:16:00,200 --> 06:16:02,700
boarding team will come
into picture and then they

8568
06:16:02,700 --> 06:16:06,700
will try to dashboard that data
and it can show you a graph

8569
06:16:06,700 --> 06:16:07,800
where you can see

8570
06:16:07,800 --> 06:16:09,892
how the trend of two different

8571
06:16:09,892 --> 06:16:13,656
or you can see various
Technologies are going ahead now

8572
06:16:13,656 --> 06:16:16,157
the solution strategy
which is there

8573
06:16:16,157 --> 06:16:20,083
so you have to continuously
stream the data from Twitter.

8574
06:16:20,083 --> 06:16:21,689
Then you will be storing

8575
06:16:21,689 --> 06:16:24,322
that those tweets
inside a cop car topic

8576
06:16:24,322 --> 06:16:25,567
then second again.

8577
06:16:25,567 --> 06:16:27,987
You have to
perform spark streaming.

8578
06:16:27,987 --> 06:16:31,009
So you will be continuously
streaming the data

8579
06:16:31,009 --> 06:16:34,300
and then you will be
applying some Transformations

8580
06:16:34,300 --> 06:16:36,900
which will basically
give you the minute trend

8581
06:16:36,900 --> 06:16:38,361
of the two technologies.

8582
06:16:38,361 --> 06:16:41,747
And again, you'll write it back
to a car for topic and at last

8583
06:16:41,747 --> 06:16:42,992
you'll write a consumer

8584
06:16:42,992 --> 06:16:46,051
that will be consuming messages
from the Casbah topic

8585
06:16:46,051 --> 06:16:49,200
and that will write the data
in your Cassandra database.

8586
06:16:49,200 --> 06:16:51,018
So First you have
to write a program

8587
06:16:51,018 --> 06:16:53,049
that will be consuming
data from Twitter

8588
06:16:53,049 --> 06:16:54,696
and I did to cough or topic.

8589
06:16:54,696 --> 06:16:56,999
Then you have to write
a spark streaming job,

8590
06:16:56,999 --> 06:17:00,200
which will be continuously
streaming the data from Kafka

8591
06:17:00,300 --> 06:17:03,300
and perform analytics
to identify the military Trend

8592
06:17:03,300 --> 06:17:06,200
and then it will write the data
back to a cuff for topic

8593
06:17:06,200 --> 06:17:08,282
and then you have
to write the third job

8594
06:17:08,282 --> 06:17:10,114
which will be
basically a consumer

8595
06:17:10,114 --> 06:17:12,668
that will consume data
from the table for topic

8596
06:17:12,668 --> 06:17:15,000
and write the data
to a Cassandra database.

8597
06:17:19,800 --> 06:17:21,709
But a spark is
a powerful framework,

8598
06:17:21,709 --> 06:17:23,960
which has been heavily
used in the industry

8599
06:17:23,960 --> 06:17:26,800
for real-time analytics
and machine learning purposes.

8600
06:17:26,800 --> 06:17:28,689
So before I proceed
with the session,

8601
06:17:28,689 --> 06:17:30,489
let's have a quick
look at the topics

8602
06:17:30,489 --> 06:17:31,968
which will be covering today.

8603
06:17:31,968 --> 06:17:33,600
So I'm starting
off by explaining

8604
06:17:33,600 --> 06:17:35,900
what exactly is by spot
and how it works.

8605
06:17:35,900 --> 06:17:36,900
When we go ahead.

8606
06:17:36,900 --> 06:17:39,819
We'll find out the various
advantages provided by spark.

8607
06:17:39,819 --> 06:17:41,200
Then I will be showing you

8608
06:17:41,200 --> 06:17:43,400
how to install
by sparking a systems.

8609
06:17:43,400 --> 06:17:45,300
Once we are done
with the installation.

8610
06:17:45,300 --> 06:17:48,200
I will talk about the
fundamental concepts of by spark

8611
06:17:48,200 --> 06:17:49,800
like this spark context.

8612
06:17:49,900 --> 06:17:53,900
Data frames MLA Oddities
and much more and finally,

8613
06:17:53,900 --> 06:17:57,100
I'll close of the session with
the demo in which I'll show you

8614
06:17:57,100 --> 06:18:00,200
how to implement by spark
to solve real life use cases.

8615
06:18:00,200 --> 06:18:01,791
So without any further Ado,

8616
06:18:01,791 --> 06:18:04,621
let's quickly embark
on our journey to pie spot now

8617
06:18:04,621 --> 06:18:06,558
before I start off
with by spark.

8618
06:18:06,558 --> 06:18:09,500
Let me first brief you
about the by spark ecosystem

8619
06:18:09,500 --> 06:18:13,154
as you can see from the diagram
the spark ecosystem is composed

8620
06:18:13,154 --> 06:18:16,400
of various components like
Sparks equals Park streaming.

8621
06:18:16,400 --> 06:18:19,800
Ml Abe graphics and the core
API component the spark.

8622
06:18:19,800 --> 06:18:22,000
Equal component is used
to Leverage The Power

8623
06:18:22,000 --> 06:18:23,320
of decorative queries

8624
06:18:23,320 --> 06:18:26,281
and optimize storage
by executing sql-like queries

8625
06:18:26,281 --> 06:18:27,124
on spark data,

8626
06:18:27,124 --> 06:18:28,654
which is presented in rdds

8627
06:18:28,654 --> 06:18:31,589
and other external sources
spark streaming component

8628
06:18:31,589 --> 06:18:33,882
allows developers
to perform batch processing

8629
06:18:33,882 --> 06:18:36,714
and streaming of data with ease
in the same application.

8630
06:18:36,714 --> 06:18:39,345
The machine learning library
eases the development

8631
06:18:39,345 --> 06:18:41,600
and deployment of
scalable machine learning

8632
06:18:41,600 --> 06:18:43,600
pipelines Graphics component.

8633
06:18:43,600 --> 06:18:47,100
Let's the data scientists work
with graph and non graph sources

8634
06:18:47,100 --> 06:18:49,982
to achieve flexibility
and resilience in graph.

8635
06:18:49,982 --> 06:18:51,775
Struction and Transformations

8636
06:18:51,775 --> 06:18:54,000
and finally the
spark core component.

8637
06:18:54,000 --> 06:18:56,723
It is the most vital component
of spark ecosystem,

8638
06:18:56,723 --> 06:18:57,900
which is responsible

8639
06:18:57,900 --> 06:19:00,644
for basic input output
functions scheduling

8640
06:19:00,644 --> 06:19:04,172
and monitoring the entire
spark ecosystem is built on top

8641
06:19:04,172 --> 06:19:06,014
of this code execution engine

8642
06:19:06,014 --> 06:19:09,000
which has extensible apis
in different languages

8643
06:19:09,000 --> 06:19:12,300
like Scala Python and Java
and in today's session,

8644
06:19:12,300 --> 06:19:13,915
I will specifically discuss

8645
06:19:13,915 --> 06:19:16,967
about the spark API
in Python programming languages,

8646
06:19:16,967 --> 06:19:19,600
which is more popularly
known as the pie Spa.

8647
06:19:19,700 --> 06:19:22,839
You might be wondering
why pie spot well to get

8648
06:19:22,839 --> 06:19:24,000
a better Insight.

8649
06:19:24,000 --> 06:19:26,400
Let me give you a brief
into pie spot.

8650
06:19:26,400 --> 06:19:29,300
Now as you already know
by spec is the collaboration

8651
06:19:29,300 --> 06:19:31,050
of two powerful Technologies,

8652
06:19:31,050 --> 06:19:32,500
which are spark which is

8653
06:19:32,500 --> 06:19:35,459
an open-source clustering
Computing framework built

8654
06:19:35,459 --> 06:19:38,300
around speed ease of use
and streaming analytics.

8655
06:19:38,300 --> 06:19:40,707
And the other one is python,
of course python,

8656
06:19:40,707 --> 06:19:43,900
which is a general purpose
high-level programming language.

8657
06:19:43,900 --> 06:19:46,900
It provides wide range
of libraries and is majorly used

8658
06:19:46,900 --> 06:19:50,000
for machine learning
and real-time analytics now,

8659
06:19:50,000 --> 06:19:52,000
Now which gives us by spark

8660
06:19:52,000 --> 06:19:53,852
which is a python
a pay for spark

8661
06:19:53,852 --> 06:19:56,581
that lets you harness
the Simplicity of Python

8662
06:19:56,581 --> 06:19:58,400
and The Power of Apache spark.

8663
06:19:58,400 --> 06:20:01,059
In order to tame
pick data up ice pack.

8664
06:20:01,059 --> 06:20:03,398
Also lets you use
the rdds and come

8665
06:20:03,398 --> 06:20:06,700
with a default integration
of Pi Forge a library.

8666
06:20:06,700 --> 06:20:10,397
We learn about rdds later
in this video now that you know,

8667
06:20:10,397 --> 06:20:11,500
what is pi spark.

8668
06:20:11,500 --> 06:20:14,400
Let's now see the advantages
of using spark with python

8669
06:20:14,400 --> 06:20:17,700
as we all know python
itself is very simple and easy.

8670
06:20:17,700 --> 06:20:20,700
So when Spock is written
in Python it To participate

8671
06:20:20,700 --> 06:20:22,837
quite easy to learn
and use moreover.

8672
06:20:22,837 --> 06:20:24,737
It's a dynamically type language

8673
06:20:24,737 --> 06:20:28,300
which means Oddities can hold
objects of multiple data types.

8674
06:20:28,300 --> 06:20:30,711
Not only does it also
makes the EPA simple

8675
06:20:30,711 --> 06:20:32,400
and comprehensive and talking

8676
06:20:32,400 --> 06:20:34,700
about the readability
of code maintenance

8677
06:20:34,700 --> 06:20:36,700
and familiarity with
the python API

8678
06:20:36,700 --> 06:20:38,577
for purchase Park is far better

8679
06:20:38,577 --> 06:20:41,000
than other programming
languages python also

8680
06:20:41,000 --> 06:20:43,100
provides various options
for visualization,

8681
06:20:43,100 --> 06:20:46,180
which is not possible using
Scala or Java moreover.

8682
06:20:46,180 --> 06:20:49,200
You can conveniently call
are directly from python

8683
06:20:49,200 --> 06:20:50,800
on above this python comes

8684
06:20:50,800 --> 06:20:52,300
with a wide range of libraries

8685
06:20:52,300 --> 06:20:55,800
like numpy pandas
Caitlin Seaborn matplotlib

8686
06:20:55,800 --> 06:20:57,912
and these debris is
in data analysis

8687
06:20:57,912 --> 06:20:59,300
and also provide mature

8688
06:20:59,300 --> 06:21:02,564
and time test statistics
with all these feature.

8689
06:21:02,564 --> 06:21:04,100
You can effortlessly program

8690
06:21:04,100 --> 06:21:06,700
and spice Park in case
you get stuck somewhere

8691
06:21:06,700 --> 06:21:07,600
or habit out.

8692
06:21:07,600 --> 06:21:08,835
There is a huge price

8693
06:21:08,835 --> 06:21:12,600
but Community out there whom you
can reach out and put your query

8694
06:21:12,600 --> 06:21:13,800
and that is very actor.

8695
06:21:13,800 --> 06:21:16,647
So I will make good use
of this opportunity to show you

8696
06:21:16,647 --> 06:21:18,000
how to install Pi spark

8697
06:21:18,000 --> 06:21:20,900
in a system now here
I'm using Red Hat Linux

8698
06:21:20,900 --> 06:21:24,400
based sent to a system
the same steps can be applied

8699
06:21:24,400 --> 06:21:26,000
for using Linux systems as well.

8700
06:21:26,200 --> 06:21:28,500
So in order to install
Pi spark first,

8701
06:21:28,500 --> 06:21:31,100
make sure that you have
Hadoop installed in your system.

8702
06:21:31,100 --> 06:21:33,700
So if you want to know more
about how to install Ado,

8703
06:21:33,700 --> 06:21:36,500
please check out
our new playlist on YouTube

8704
06:21:36,500 --> 06:21:39,909
or you can check out our blog on
a direct our website the first

8705
06:21:39,909 --> 06:21:43,100
of all you need to go to the
Apache spark official website,

8706
06:21:43,100 --> 06:21:44,750
which is parked
at a party Dot o-- r--

8707
06:21:44,750 --> 06:21:48,025
g-- and the download section you
can download the latest version

8708
06:21:48,025 --> 06:21:48,907
of spark release

8709
06:21:48,907 --> 06:21:51,500
which supports It's
the latest version of Hadoop

8710
06:21:51,500 --> 06:21:53,800
or Hadoop version
2.7 or above now.

8711
06:21:53,800 --> 06:21:55,429
Once you have downloaded it,

8712
06:21:55,429 --> 06:21:57,900
all you need to do is
extract it or add say

8713
06:21:57,900 --> 06:21:59,400
under the file contents.

8714
06:21:59,400 --> 06:22:01,400
And after that you
need to put in the path

8715
06:22:01,400 --> 06:22:04,200
where the spark is installed
in the bash RC file.

8716
06:22:04,200 --> 06:22:06,082
Now, you also need
to install pip

8717
06:22:06,082 --> 06:22:09,300
and jupyter notebook using
the pipe command and make sure

8718
06:22:09,300 --> 06:22:11,700
that the version
of piston or above so

8719
06:22:11,700 --> 06:22:12,820
as you can see here,

8720
06:22:12,820 --> 06:22:16,114
this is what our bash RC file
looks like here you can see

8721
06:22:16,114 --> 06:22:17,700
that we have put in the path

8722
06:22:17,700 --> 06:22:20,700
for Hadoop spark and as
well as Spunk driver python,

8723
06:22:20,700 --> 06:22:22,200
which is The jupyter Notebook.

8724
06:22:22,200 --> 06:22:23,087
What we'll do.

8725
06:22:23,087 --> 06:22:25,939
Is that the moment you
run the pie Spock shell

8726
06:22:25,939 --> 06:22:29,300
it will automatically open
a jupyter notebook for you.

8727
06:22:29,300 --> 06:22:29,551
Now.

8728
06:22:29,551 --> 06:22:32,000
I find jupyter notebook
very easy to work

8729
06:22:32,000 --> 06:22:35,700
with rather than the shell
is supposed to search choice now

8730
06:22:35,700 --> 06:22:37,899
that we are done
with the installation path.

8731
06:22:37,899 --> 06:22:40,100
Let's now dive deeper
into pie Sparkle on few

8732
06:22:40,100 --> 06:22:41,100
of its fundamentals,

8733
06:22:41,100 --> 06:22:43,770
which you need to know
in order to work with by Spar.

8734
06:22:43,770 --> 06:22:45,870
Now this timeline shows
the various topics,

8735
06:22:45,870 --> 06:22:48,600
which we will be covering under
the pie spark fundamentals.

8736
06:22:48,700 --> 06:22:49,650
So let's start off.

8737
06:22:49,650 --> 06:22:51,500
With the very first
Topic in our list.

8738
06:22:51,500 --> 06:22:53,100
That is the spark context.

8739
06:22:53,100 --> 06:22:56,335
The spark context is the heart
of any spark application.

8740
06:22:56,335 --> 06:22:59,518
It sets up internal services
and establishes a connection

8741
06:22:59,518 --> 06:23:03,300
to a spark execution environment
through a spark context object.

8742
06:23:03,300 --> 06:23:05,357
You can create rdds accumulators

8743
06:23:05,357 --> 06:23:09,000
and broadcast variable
access Park service's run jobs

8744
06:23:09,000 --> 06:23:11,362
and much more
the spark context allows

8745
06:23:11,362 --> 06:23:14,094
the spark driver application
to access the cluster

8746
06:23:14,094 --> 06:23:15,600
through a resource manager,

8747
06:23:15,600 --> 06:23:16,600
which can be yarn

8748
06:23:16,600 --> 06:23:19,600
or Sparks cluster manager
the driver program then runs.

8749
06:23:19,700 --> 06:23:23,044
Operations inside the executors
on the worker nodes

8750
06:23:23,044 --> 06:23:26,478
and Spark context uses the pie
for Jay to launch a jvm

8751
06:23:26,478 --> 06:23:29,200
which in turn creates
a Java spark context.

8752
06:23:29,200 --> 06:23:30,884
Now there are
various parameters,

8753
06:23:30,884 --> 06:23:33,200
which can be used
with a spark context object

8754
06:23:33,200 --> 06:23:34,700
like the Master app name

8755
06:23:34,700 --> 06:23:37,366
spark home the pie
files the environment

8756
06:23:37,366 --> 06:23:41,600
in which has set the path size
serializer configuration Gateway

8757
06:23:41,600 --> 06:23:44,267
and much more
among these parameters

8758
06:23:44,267 --> 06:23:47,700
the master and app name
are the most commonly used now

8759
06:23:47,700 --> 06:23:51,061
to give you a basic Insight
on how Spark program works.

8760
06:23:51,061 --> 06:23:53,807
I have listed down
its basic lifecycle phases

8761
06:23:53,807 --> 06:23:56,903
the typical life cycle
of a spark program includes

8762
06:23:56,903 --> 06:23:59,367
creating rdds from
external data sources

8763
06:23:59,367 --> 06:24:02,400
or paralyzed a collection
in your driver program.

8764
06:24:02,400 --> 06:24:05,361
Then we have the lazy
transformation in a lazily

8765
06:24:05,361 --> 06:24:07,064
transforming the base rdds

8766
06:24:07,064 --> 06:24:10,600
into new Oddities using
transformation then caching few

8767
06:24:10,600 --> 06:24:12,700
of those rdds for future reuse

8768
06:24:12,800 --> 06:24:15,800
and finally performing action
to execute parallel computation

8769
06:24:15,800 --> 06:24:17,500
and to produce the results.

8770
06:24:17,500 --> 06:24:19,800
The next Topic
in our list is added.

8771
06:24:19,800 --> 06:24:20,700
And I'm sure people

8772
06:24:20,700 --> 06:24:23,700
who have already worked with
spark a familiar with this term,

8773
06:24:23,700 --> 06:24:25,582
but for people
who are new to it,

8774
06:24:25,582 --> 06:24:26,900
let me just explain it.

8775
06:24:26,900 --> 06:24:29,782
No Artie T stands for
resilient distributed data set.

8776
06:24:29,782 --> 06:24:32,000
It is considered to be
the building block

8777
06:24:32,000 --> 06:24:33,433
of any spark application.

8778
06:24:33,433 --> 06:24:35,900
The reason behind this
is these elements run

8779
06:24:35,900 --> 06:24:38,600
and operate on multiple nodes
to do parallel processing

8780
06:24:38,600 --> 06:24:39,400
on a cluster.

8781
06:24:39,400 --> 06:24:40,952
And once you create an RTD,

8782
06:24:40,952 --> 06:24:43,273
it becomes immutable
and by imitable,

8783
06:24:43,273 --> 06:24:46,637
I mean that it is an object
whose State cannot be modified

8784
06:24:46,637 --> 06:24:47,700
after its created,

8785
06:24:47,700 --> 06:24:49,654
but we can transform
its values by up.

8786
06:24:49,654 --> 06:24:51,438
Applying certain transformation.

8787
06:24:51,438 --> 06:24:53,500
They have good
fault tolerance ability

8788
06:24:53,500 --> 06:24:56,700
and can automatically recover
for almost any failures.

8789
06:24:56,700 --> 06:25:00,700
This adds an added Advantage
not to achieve a certain task

8790
06:25:00,700 --> 06:25:03,205
multiple operations can
be applied on these IDs

8791
06:25:03,205 --> 06:25:05,675
which are categorized
in two ways the first

8792
06:25:05,675 --> 06:25:06,800
in the transformation

8793
06:25:06,800 --> 06:25:09,900
and the second one is
the actions the Transformations

8794
06:25:09,900 --> 06:25:10,800
are the operations

8795
06:25:10,800 --> 06:25:13,800
which are applied on an oddity
to create a new rdd.

8796
06:25:14,000 --> 06:25:15,300
Now these transformation work

8797
06:25:15,300 --> 06:25:17,300
on the principle
of lazy evaluation

8798
06:25:17,700 --> 06:25:19,900
and transformation
are lazy in nature.

8799
06:25:19,900 --> 06:25:22,927
Meaning when we call
some operation in our dirty.

8800
06:25:22,927 --> 06:25:25,758
It does not execute
immediately spark maintains,

8801
06:25:25,758 --> 06:25:28,602
the record of the operations
it is being called

8802
06:25:28,602 --> 06:25:31,324
through with the help
of direct acyclic graphs,

8803
06:25:31,324 --> 06:25:33,100
which is also known as the DHS

8804
06:25:33,100 --> 06:25:35,900
and since the Transformations
are lazy in nature.

8805
06:25:35,900 --> 06:25:37,604
So when we execute operation

8806
06:25:37,604 --> 06:25:40,100
any time by calling
an action on the data,

8807
06:25:40,100 --> 06:25:42,371
the lazy evaluation
data is not loaded

8808
06:25:42,371 --> 06:25:43,547
until it's necessary

8809
06:25:43,547 --> 06:25:46,900
and the moment we call out
the action all the computations

8810
06:25:46,900 --> 06:25:49,900
are performed parallely to give
you the desired output.

8811
06:25:49,900 --> 06:25:52,400
Put now a few important
Transformations are

8812
06:25:52,400 --> 06:25:53,944
the map flatmap filter

8813
06:25:53,944 --> 06:25:55,360
this thing reduced by

8814
06:25:55,360 --> 06:25:59,000
key map partition sort by
actions are the operations

8815
06:25:59,000 --> 06:26:02,058
which are applied on an rdd
to instruct a party spark

8816
06:26:02,058 --> 06:26:03,188
to apply computation

8817
06:26:03,188 --> 06:26:05,600
and pass the result back
to the driver few

8818
06:26:05,600 --> 06:26:09,100
of these actions include
collect the collectors mapreduce

8819
06:26:09,100 --> 06:26:10,300
take first now,

8820
06:26:10,300 --> 06:26:13,600
let me Implement few of these
for your better understanding.

8821
06:26:14,600 --> 06:26:17,000
So first of all,
let me show you the bash

8822
06:26:17,000 --> 06:26:18,800
as if I'll which I
was talking about.

8823
06:26:25,100 --> 06:26:27,196
So here you can see
in the bash RC file.

8824
06:26:27,196 --> 06:26:29,400
We provide the path
for all the Frameworks

8825
06:26:29,400 --> 06:26:31,250
which we have installed
in the system.

8826
06:26:31,250 --> 06:26:32,800
So for example,
you can see here.

8827
06:26:32,800 --> 06:26:35,100
We have installed
Hadoop the moment we

8828
06:26:35,100 --> 06:26:38,100
install and unzip it
or rather see entire it

8829
06:26:38,100 --> 06:26:41,300
I have shifted all my Frameworks
to one particular location

8830
06:26:41,300 --> 06:26:43,492
as you can see is
the US are the user

8831
06:26:43,492 --> 06:26:46,140
and inside this we have
the library and inside

8832
06:26:46,140 --> 06:26:49,217
that I have installed the Hadoop
and also the spa now

8833
06:26:49,217 --> 06:26:50,400
as you can see here,

8834
06:26:50,400 --> 06:26:51,300
we have two lines.

8835
06:26:51,300 --> 06:26:54,800
I'll highlight this one for
you the pie spark driver.

8836
06:26:54,800 --> 06:26:56,392
Titan which is the Jupiter

8837
06:26:56,392 --> 06:26:59,700
and we have given it as
a notebook the option available

8838
06:26:59,700 --> 06:27:02,100
as know to what we'll do
is at the moment.

8839
06:27:02,100 --> 06:27:04,731
I start spark will
automatically redirect me

8840
06:27:04,731 --> 06:27:06,200
to The jupyter Notebook.

8841
06:27:10,200 --> 06:27:14,500
So let me just rename
this notebook as rdd tutorial.

8842
06:27:15,200 --> 06:27:16,900
So let's get started.

8843
06:27:17,800 --> 06:27:21,000
So here to load any file
into an rdd suppose.

8844
06:27:21,000 --> 06:27:23,795
I'm loading a text file
you need to use the S

8845
06:27:23,795 --> 06:27:26,700
if it is a spark context
as C dot txt file

8846
06:27:26,700 --> 06:27:28,952
and you need to provide
the path of the data

8847
06:27:28,952 --> 06:27:30,600
which you are going to load.

8848
06:27:30,600 --> 06:27:33,300
So one thing to keep
in mind is that the default path

8849
06:27:33,300 --> 06:27:35,483
which the artery takes
or the jupyter.

8850
06:27:35,483 --> 06:27:37,365
Notebook takes is the hdfs path.

8851
06:27:37,365 --> 06:27:39,456
So in order to use
the local file system,

8852
06:27:39,456 --> 06:27:41,311
you need to mention
the file colon

8853
06:27:41,311 --> 06:27:42,900
and double forward slashes.

8854
06:27:43,800 --> 06:27:46,100
So once our sample data is

8855
06:27:46,100 --> 06:27:49,076
inside the ret not to
have a look at it.

8856
06:27:49,076 --> 06:27:52,000
We need to invoke
using it the action.

8857
06:27:52,000 --> 06:27:54,900
So let's go ahead and take
a look at the first five objects

8858
06:27:54,900 --> 06:27:59,400
or rather say the first five
elements of this particular rdt.

8859
06:27:59,700 --> 06:28:02,776
The sample it I have taken
here is about blockchain

8860
06:28:02,776 --> 06:28:03,700
as you can see.

8861
06:28:03,700 --> 06:28:05,000
We have one two,

8862
06:28:05,030 --> 06:28:07,569
three four and
five elements here.

8863
06:28:08,500 --> 06:28:12,080
Suppose I need to convert
all the data into a lowercase

8864
06:28:12,080 --> 06:28:14,600
and split it according
to word by word.

8865
06:28:14,600 --> 06:28:17,000
So for that I will
create a function

8866
06:28:17,000 --> 06:28:20,000
and in the function
I'll pass on this Oddity.

8867
06:28:20,000 --> 06:28:21,700
So I'm creating
as you can see here.

8868
06:28:21,700 --> 06:28:22,990
I'm creating rdd one

8869
06:28:22,990 --> 06:28:25,700
that is a new ID
and using the map function

8870
06:28:25,700 --> 06:28:29,200
or rather say the transformation
and passing on the function,

8871
06:28:29,200 --> 06:28:32,100
which I just created to lower
and to split it.

8872
06:28:32,496 --> 06:28:35,803
So if we have a look
at the output of our D1

8873
06:28:37,800 --> 06:28:39,059
As you can see here,

8874
06:28:39,059 --> 06:28:41,200
all the words are
in the lower case

8875
06:28:41,200 --> 06:28:44,300
and all of them are separated
with the help of a space bar.

8876
06:28:44,700 --> 06:28:47,000
Now this another transformation,

8877
06:28:47,000 --> 06:28:50,216
which is known as the flat map
to give you a flat and output

8878
06:28:50,216 --> 06:28:52,157
and I am passing
the same function

8879
06:28:52,157 --> 06:28:53,569
which I created earlier.

8880
06:28:53,569 --> 06:28:54,500
So let's go ahead

8881
06:28:54,500 --> 06:28:56,800
and have a look
at the output for this one.

8882
06:28:56,800 --> 06:28:58,200
So as you can see here,

8883
06:28:58,200 --> 06:29:00,189
we got the first five elements

8884
06:29:00,189 --> 06:29:04,355
which are the save one as we got
here the contrast transactions

8885
06:29:04,355 --> 06:29:05,700
and and the records.

8886
06:29:05,700 --> 06:29:07,523
So just one thing
to keep in mind.

8887
06:29:07,523 --> 06:29:09,700
Is at the flat map
is a transformation

8888
06:29:09,700 --> 06:29:11,664
where as take is the action now,

8889
06:29:11,664 --> 06:29:13,614
as you can see
that the contents

8890
06:29:13,614 --> 06:29:16,007
of the sample data
contains stop words.

8891
06:29:16,007 --> 06:29:18,762
So if I want to remove
all the stop was all you

8892
06:29:18,762 --> 06:29:19,900
need to do is start

8893
06:29:19,900 --> 06:29:23,351
and create a list of stop words
in which I have mentioned here

8894
06:29:23,351 --> 06:29:24,200
as you can see.

8895
06:29:24,200 --> 06:29:26,200
We have a all the as is

8896
06:29:26,200 --> 06:29:28,700
and now these are
not all the stop words.

8897
06:29:28,700 --> 06:29:31,701
So I've chosen only a few
of them just to show you

8898
06:29:31,701 --> 06:29:33,600
what exactly the output will be

8899
06:29:33,600 --> 06:29:36,100
and now we are using here
the filter transformation

8900
06:29:36,100 --> 06:29:37,800
and with the help of Lambda.

8901
06:29:37,800 --> 06:29:40,800
Function and which we have
X specified as X naught

8902
06:29:40,800 --> 06:29:43,360
in stock quotes and we
have created another rdd

8903
06:29:43,360 --> 06:29:44,465
which is added III

8904
06:29:44,465 --> 06:29:46,000
which will take the input

8905
06:29:46,000 --> 06:29:48,800
from our DD to so
let's go ahead and see

8906
06:29:48,800 --> 06:29:51,700
whether and and the
are removed or not.

8907
06:29:51,700 --> 06:29:55,600
This is you can see contracts
transaction records of them.

8908
06:29:55,600 --> 06:29:57,500
If you look at the output 5,

8909
06:29:57,500 --> 06:30:00,979
we have contracts transaction
and and the and in the

8910
06:30:00,979 --> 06:30:02,337
are not in this list,

8911
06:30:02,337 --> 06:30:04,600
but suppose I want
to group the data

8912
06:30:04,600 --> 06:30:07,523
according to the first
three characters of any element.

8913
06:30:07,523 --> 06:30:08,756
So for that I'll use

8914
06:30:08,756 --> 06:30:11,900
the group by and I'll use
the Lambda function again.

8915
06:30:11,900 --> 06:30:14,000
So let's have a look
at the output

8916
06:30:14,000 --> 06:30:16,769
so you can see we
have EDG and edges.

8917
06:30:16,900 --> 06:30:20,638
So the first three letters of
both words are same similarly.

8918
06:30:20,638 --> 06:30:23,300
We can find it using
the first two letters.

8919
06:30:23,300 --> 06:30:27,800
Also, let me just change it
to two so you can see we are gu

8920
06:30:27,800 --> 06:30:29,800
and guid just a guide

8921
06:30:30,000 --> 06:30:32,200
not these are
the basic Transformations

8922
06:30:32,200 --> 06:30:33,785
and actions but suppose.

8923
06:30:33,785 --> 06:30:37,400
I want to find out the sum
of the first thousand numbers.

8924
06:30:37,400 --> 06:30:39,436
Others have first
10,000 numbers.

8925
06:30:39,436 --> 06:30:42,300
All I need to do
is initialize another Oddity,

8926
06:30:42,300 --> 06:30:44,400
which is the
number underscore ID.

8927
06:30:44,400 --> 06:30:47,512
And we use the AC Dot
parallelized and the range

8928
06:30:47,512 --> 06:30:49,500
we have given is one to 10,000

8929
06:30:49,500 --> 06:30:51,600
and we'll use the reduce action

8930
06:30:51,600 --> 06:30:54,532
here to see the output
you can see here.

8931
06:30:54,532 --> 06:30:56,840
We have the sum
of the numbers ranging

8932
06:30:56,840 --> 06:30:58,400
from one to ten thousand.

8933
06:30:58,400 --> 06:31:00,900
Now this was all about rdd.

8934
06:31:00,900 --> 06:31:01,699
The next topic

8935
06:31:01,699 --> 06:31:03,711
that we have
on a list is broadcast

8936
06:31:03,711 --> 06:31:07,181
and accumulators now in spark
we perform parallel processing

8937
06:31:07,181 --> 06:31:09,100
through the Help
of shared variables

8938
06:31:09,100 --> 06:31:11,672
or when the driver sends
any tasks with the executor

8939
06:31:11,672 --> 06:31:14,900
present on the cluster a copy of
the shared variable is also sent

8940
06:31:14,900 --> 06:31:15,700
to the each node

8941
06:31:15,700 --> 06:31:18,100
of the cluster thus
maintaining High availability

8942
06:31:18,100 --> 06:31:19,400
and fault tolerance.

8943
06:31:19,400 --> 06:31:22,223
Now, this is done in order
to accomplish the task

8944
06:31:22,223 --> 06:31:25,341
and Apache spark supposed
to type of shared variables.

8945
06:31:25,341 --> 06:31:26,711
One of them is broadcast.

8946
06:31:26,711 --> 06:31:28,861
And the other one is
the accumulator now

8947
06:31:28,861 --> 06:31:31,735
broadcast variables are used
to save the copy of data

8948
06:31:31,735 --> 06:31:33,334
on all the nodes in a cluster.

8949
06:31:33,334 --> 06:31:36,117
Whereas the accumulator is
the variable that is used

8950
06:31:36,117 --> 06:31:37,700
for aggregating the incoming.

8951
06:31:37,700 --> 06:31:40,056
Information we are
different associative

8952
06:31:40,056 --> 06:31:43,500
and commutative operations now
moving on to our next topic

8953
06:31:43,500 --> 06:31:47,094
which is a spark configuration
the spark configuration class

8954
06:31:47,094 --> 06:31:49,800
provides a set
of configurations and parameters

8955
06:31:49,800 --> 06:31:52,300
that are needed to execute
a spark application

8956
06:31:52,300 --> 06:31:54,300
on the local system
or any cluster.

8957
06:31:54,300 --> 06:31:56,800
Now when you use
spark configuration object

8958
06:31:56,800 --> 06:31:59,112
to set the values
to these parameters,

8959
06:31:59,112 --> 06:32:02,800
they automatically take priority
over the system properties.

8960
06:32:02,800 --> 06:32:05,035
Now this class
contains various Getters

8961
06:32:05,035 --> 06:32:07,800
and Setters methods some
of which are Set method

8962
06:32:07,800 --> 06:32:10,323
which is used to set
a configuration property.

8963
06:32:10,323 --> 06:32:11,555
We have the set master

8964
06:32:11,555 --> 06:32:13,605
which is used for setting
the master URL.

8965
06:32:13,605 --> 06:32:14,840
Yeah the set app name,

8966
06:32:14,840 --> 06:32:17,421
which is used to set
an application name and we

8967
06:32:17,421 --> 06:32:20,900
have the get method to retrieve
a configuration value of a key.

8968
06:32:20,900 --> 06:32:23,000
And finally we
have set spark home

8969
06:32:23,000 --> 06:32:25,600
which is used for setting
the spark installation path

8970
06:32:25,600 --> 06:32:26,700
on worker nodes.

8971
06:32:26,700 --> 06:32:28,800
Now coming to the next
topic on our list

8972
06:32:28,800 --> 06:32:31,600
which is a spark files
the spark file class

8973
06:32:31,600 --> 06:32:33,264
contains only the class methods

8974
06:32:33,264 --> 06:32:36,500
so that the user cannot create
any spark files instance.

8975
06:32:36,500 --> 06:32:39,200
Now this helps in Dissolving
the path of the files

8976
06:32:39,200 --> 06:32:41,500
that are added using
the spark context add

8977
06:32:41,500 --> 06:32:44,600
file method the class Park files
contain to class methods

8978
06:32:44,600 --> 06:32:47,798
which are the get method and
the get root directory method.

8979
06:32:47,798 --> 06:32:50,500
Now, the get is used
to retrieve the absolute path

8980
06:32:50,500 --> 06:32:53,900
of a file added through
spark context to add file

8981
06:32:54,000 --> 06:32:55,300
and the get root directory

8982
06:32:55,300 --> 06:32:57,076
is used to retrieve
the root directory

8983
06:32:57,076 --> 06:32:58,900
that contains the files
that are added.

8984
06:32:58,900 --> 06:33:00,700
So this park context
dot add file.

8985
06:33:00,700 --> 06:33:03,022
Now, these are smart topics
and the next topic

8986
06:33:03,022 --> 06:33:04,257
that we will covering

8987
06:33:04,257 --> 06:33:07,600
in our list are the data frames
now data frames in a party.

8988
06:33:07,600 --> 06:33:09,655
Spark is a distributed
collection of rows

8989
06:33:09,655 --> 06:33:10,831
under named columns,

8990
06:33:10,831 --> 06:33:13,400
which is similar to
the relational database tables

8991
06:33:13,400 --> 06:33:14,700
or Excel sheets.

8992
06:33:14,700 --> 06:33:16,812
It also shares common attributes

8993
06:33:16,812 --> 06:33:19,800
with the rdds few
characteristics of data frames

8994
06:33:19,800 --> 06:33:21,300
are immutable in nature.

8995
06:33:21,300 --> 06:33:23,500
That is the same
as you can create a data frame,

8996
06:33:23,500 --> 06:33:24,900
but you cannot change it.

8997
06:33:24,900 --> 06:33:26,500
It allows lazy evaluation.

8998
06:33:26,500 --> 06:33:28,300
That is the task not executed

8999
06:33:28,300 --> 06:33:30,500
unless and until
an action is triggered

9000
06:33:30,500 --> 06:33:33,000
and moreover data frames
are distributed in nature,

9001
06:33:33,000 --> 06:33:34,900
which are designed
for processing large

9002
06:33:34,900 --> 06:33:37,400
collection of structure
or semi-structured data.

9003
06:33:37,400 --> 06:33:39,953
Can be created using
different data formats,

9004
06:33:39,953 --> 06:33:41,200
like loading the data

9005
06:33:41,200 --> 06:33:43,650
from source files
such as Json or CSV,

9006
06:33:43,650 --> 06:33:46,100
or you can load it
from an existing re

9007
06:33:46,100 --> 06:33:48,842
you can use databases
like hi Cassandra.

9008
06:33:48,842 --> 06:33:50,600
You can use pocket files.

9009
06:33:50,600 --> 06:33:52,800
You can use CSV XML files.

9010
06:33:52,800 --> 06:33:53,900
There are many sources

9011
06:33:53,900 --> 06:33:56,448
through which you can create
a particular R DT now,

9012
06:33:56,448 --> 06:33:59,200
let me show you how to create
a data frame in pie spark

9013
06:33:59,200 --> 06:34:02,100
and perform various actions
and Transformations on it.

9014
06:34:02,300 --> 06:34:05,065
So let's continue this
in the same notebook

9015
06:34:05,065 --> 06:34:07,700
which we have here now
here we have taken

9016
06:34:07,700 --> 06:34:09,300
In the NYC Flight data,

9017
06:34:09,300 --> 06:34:12,561
and I'm creating a data frame
which is the NYC flights

9018
06:34:12,561 --> 06:34:13,300
on the score

9019
06:34:13,300 --> 06:34:14,959
TF now to load the data.

9020
06:34:14,959 --> 06:34:18,340
We are using the spark dot
RI dot CSV method and you

9021
06:34:18,340 --> 06:34:19,600
to provide the path

9022
06:34:19,600 --> 06:34:21,900
which is the local path
of by default.

9023
06:34:21,900 --> 06:34:24,200
It takes the hdfs same as our GD

9024
06:34:24,200 --> 06:34:26,208
and one thing
to note down here is

9025
06:34:26,208 --> 06:34:28,886
that I've provided
two parameters extra here,

9026
06:34:28,886 --> 06:34:31,400
which is the info schema
and the header

9027
06:34:31,400 --> 06:34:34,700
if we do not provide
this as true of a skip it

9028
06:34:34,700 --> 06:34:35,800
what will happen.

9029
06:34:35,800 --> 06:34:39,300
Is that if your data set Is
the name of the columns

9030
06:34:39,300 --> 06:34:42,863
on the first row it will take
those as data as well.

9031
06:34:42,863 --> 06:34:45,100
It will not infer
the schema now.

9032
06:34:45,100 --> 06:34:49,023
Once we have loaded the data
in our data frame we need to use

9033
06:34:49,023 --> 06:34:51,900
the show action to have
a look at the output.

9034
06:34:51,900 --> 06:34:53,223
So as you can see here,

9035
06:34:53,223 --> 06:34:55,399
we have the output
which is exactly it

9036
06:34:55,399 --> 06:34:58,600
gives us the top 20 rows
or the particular data set.

9037
06:34:58,600 --> 06:35:02,600
We have the year month day
departure time deposit delay

9038
06:35:02,600 --> 06:35:07,000
arrival time arrival delay
and so many more attributes.

9039
06:35:07,300 --> 06:35:08,500
To print the schema

9040
06:35:08,500 --> 06:35:11,500
of the particular data frame
you need the transformation

9041
06:35:11,500 --> 06:35:13,762
or as say the action
of print schema.

9042
06:35:13,762 --> 06:35:15,900
So let's have a look
at the schema.

9043
06:35:15,900 --> 06:35:19,117
As you can see here we have here
which is integer month integer.

9044
06:35:19,117 --> 06:35:21,000
Almost half of them are integer.

9045
06:35:21,000 --> 06:35:23,600
We have the carrier as
string the tail number

9046
06:35:23,600 --> 06:35:26,625
a string the origin
string destination string

9047
06:35:26,625 --> 06:35:28,123
and so on now suppose.

9048
06:35:28,123 --> 06:35:29,075
I want to know

9049
06:35:29,075 --> 06:35:31,786
how many records are
there in my database

9050
06:35:31,786 --> 06:35:33,685
or the data frame rather say

9051
06:35:33,685 --> 06:35:36,600
so you need the count
function for this one.

9052
06:35:36,600 --> 06:35:40,600
I will provide but the results
so as you can see here,

9053
06:35:40,600 --> 06:35:42,992
we have three point
three million records

9054
06:35:42,992 --> 06:35:44,097
here three million

9055
06:35:44,097 --> 06:35:46,800
thirty six thousand
seven hundred seventy six

9056
06:35:46,800 --> 06:35:48,400
to be exact now suppose.

9057
06:35:48,400 --> 06:35:51,153
I want to have a look
at the flight name the origin

9058
06:35:51,153 --> 06:35:52,400
and the destination

9059
06:35:52,400 --> 06:35:55,400
of just these three columns
from the particular data frame.

9060
06:35:55,400 --> 06:35:57,800
We need to use
the select option.

9061
06:35:58,200 --> 06:36:00,882
So as you can see here,
we have the top 20 rows.

9062
06:36:00,882 --> 06:36:03,128
Now, what we saw
was the select query

9063
06:36:03,128 --> 06:36:05,000
on this particular data frame,

9064
06:36:05,000 --> 06:36:07,240
but if I wanted
to see or rather,

9065
06:36:07,240 --> 06:36:09,200
I want to check the summary.

9066
06:36:09,200 --> 06:36:11,400
Of any particular
column suppose.

9067
06:36:11,400 --> 06:36:14,500
I want to check the
what is the lowest count

9068
06:36:14,500 --> 06:36:18,100
or the highest count in
the particular distance column.

9069
06:36:18,100 --> 06:36:20,500
I need to use
the describe function here.

9070
06:36:20,500 --> 06:36:23,100
So I'll show you what
the summer it looks like.

9071
06:36:23,500 --> 06:36:25,142
So the distance the count

9072
06:36:25,142 --> 06:36:27,900
is the number of rows
total number of rows.

9073
06:36:27,900 --> 06:36:30,800
We have the mean the standard
deviation via the minimum value,

9074
06:36:30,800 --> 06:36:32,900
which is 17
and the maximum value,

9075
06:36:32,900 --> 06:36:34,500
which is 4983.

9076
06:36:34,900 --> 06:36:38,100
Now this gives you a summary
of the particular column

9077
06:36:38,100 --> 06:36:39,856
if you want to So
that we know

9078
06:36:39,856 --> 06:36:41,838
that the minimum distance is 70.

9079
06:36:41,838 --> 06:36:44,500
Let's go ahead and filter
out our data using

9080
06:36:44,500 --> 06:36:47,700
the filter function
in which the distance is 17.

9081
06:36:48,700 --> 06:36:49,978
So you can see here.

9082
06:36:49,978 --> 06:36:51,000
We have one data

9083
06:36:51,000 --> 06:36:55,700
in which in the 2013 year
the minimum distance here is 17

9084
06:36:55,700 --> 06:36:59,100
but similarly suppose I want
to have a look at the flash

9085
06:36:59,100 --> 06:37:01,600
which are originating from EWR.

9086
06:37:01,900 --> 06:37:02,400
Similarly.

9087
06:37:02,400 --> 06:37:04,600
We use the filter
function here as well.

9088
06:37:04,600 --> 06:37:06,599
Now the another Clause here,

9089
06:37:06,599 --> 06:37:09,300
which is the where
Clause is also used

9090
06:37:09,300 --> 06:37:11,236
for filtering the suppose.

9091
06:37:11,236 --> 06:37:12,800
I want to have a look

9092
06:37:12,815 --> 06:37:16,046
at the flight data
and filter it out to see

9093
06:37:16,046 --> 06:37:17,507
if the day at work.

9094
06:37:17,507 --> 06:37:22,000
Which the flight took off was
the second of any month suppose.

9095
06:37:22,000 --> 06:37:23,589
So here instead of filter.

9096
06:37:23,589 --> 06:37:25,422
We can also use a where clause

9097
06:37:25,422 --> 06:37:27,500
which will give us
the same output.

9098
06:37:29,200 --> 06:37:33,100
Now, we can also pass
on multiple parameters

9099
06:37:33,100 --> 06:37:36,000
and rather say
the multiple conditions.

9100
06:37:36,000 --> 06:37:39,866
So suppose I want the day
of the flight should be seventh

9101
06:37:39,866 --> 06:37:41,839
and the origin should be JFK

9102
06:37:41,839 --> 06:37:45,292
and the arrival delay
should be less than 0 I mean

9103
06:37:45,292 --> 06:37:47,900
that is for none
of the postponed fly.

9104
06:37:48,000 --> 06:37:49,600
So just to have a look

9105
06:37:49,600 --> 06:37:52,314
at these numbers
will use the way clause

9106
06:37:52,314 --> 06:37:55,600
and separate all the conditions
using the   symbol

9107
06:37:56,100 --> 06:37:57,800
so you can see
here all the data.

9108
06:37:57,800 --> 06:38:00,700
The day is 7 the origin is JFK

9109
06:38:01,100 --> 06:38:04,900
and the arrival delay
is less than 0 now.

9110
06:38:04,900 --> 06:38:07,621
These were the basic
Transformations and actions

9111
06:38:07,621 --> 06:38:09,300
on the particular data frame.

9112
06:38:09,300 --> 06:38:12,900
Now one thing we can also do
is create a temporary table

9113
06:38:12,900 --> 06:38:14,100
for SQL queries

9114
06:38:14,100 --> 06:38:15,100
if someone is

9115
06:38:15,100 --> 06:38:19,000
not good or is not Wanted
to all these transformation

9116
06:38:19,000 --> 06:38:22,400
and action add would rather
use SQL queries on the data.

9117
06:38:22,400 --> 06:38:26,006
They can use this register dot
temp table to create a table

9118
06:38:26,006 --> 06:38:27,925
for their particular data frame.

9119
06:38:27,925 --> 06:38:30,129
What we'll do is
convert the NYC flights

9120
06:38:30,129 --> 06:38:33,600
and a Squatty of data frame
into NYC endoscope flight table,

9121
06:38:33,600 --> 06:38:36,700
which can be used later
and SQL queries can be performed

9122
06:38:36,700 --> 06:38:38,500
on this particular table.

9123
06:38:38,600 --> 06:38:43,000
So you remember in the beginning
we use the NYC flies and score d

9124
06:38:43,000 --> 06:38:47,600
f dot show now we can use
the select asterisk from I

9125
06:38:47,600 --> 06:38:51,600
am just go flights to get
the same output now suppose

9126
06:38:51,600 --> 06:38:55,011
we want to look at the minimum
a time of any flights.

9127
06:38:55,011 --> 06:38:58,217
We use the select minimum
air time from NYC flights.

9128
06:38:58,217 --> 06:38:59,600
That is the SQL query.

9129
06:38:59,600 --> 06:39:02,400
We pass all the SQL query
in the sequel context

9130
06:39:02,400 --> 06:39:03,700
or SQL function.

9131
06:39:03,700 --> 06:39:04,800
So you can see here.

9132
06:39:04,800 --> 06:39:07,900
We have the minimum air time
as 20 now to have a look

9133
06:39:07,900 --> 06:39:11,400
at the Wreckers in which
the air time is minimum 20.

9134
06:39:11,600 --> 06:39:14,693
Now we can also use
nested SQL queries a suppose

9135
06:39:14,693 --> 06:39:15,847
if I want to check

9136
06:39:15,847 --> 06:39:19,328
which all flights have
the Minimum air time as 20 now

9137
06:39:19,328 --> 06:39:20,553
that cannot be done

9138
06:39:20,553 --> 06:39:24,132
in a simple SQL query we need
nested query for that one.

9139
06:39:24,132 --> 06:39:26,800
So selecting aspects
from New York flights

9140
06:39:26,800 --> 06:39:29,500
where the airtime
is in and inside

9141
06:39:29,500 --> 06:39:30,913
that we have another query,

9142
06:39:30,913 --> 06:39:33,477
which is Select minimum air time
from NYC flights.

9143
06:39:33,477 --> 06:39:35,100
Let's see if this works or not.

9144
06:39:37,200 --> 06:39:38,497
CS as you can see here,

9145
06:39:38,497 --> 06:39:41,600
we have two Flats which have
the minimum air time as 20.

9146
06:39:42,200 --> 06:39:44,400
So guys this is it
for data frames.

9147
06:39:44,400 --> 06:39:46,147
So let's get back
to our presentation

9148
06:39:46,147 --> 06:39:48,697
and have a look at the list
which we were following.

9149
06:39:48,697 --> 06:39:49,966
We completed data frames.

9150
06:39:49,966 --> 06:39:52,600
Next we have stories levels
now Storage level

9151
06:39:52,600 --> 06:39:55,200
in pie spark is a class
which helps in deciding

9152
06:39:55,200 --> 06:39:56,991
how the rdds should be stored

9153
06:39:56,991 --> 06:39:59,400
now based on this rdds
are either stored

9154
06:39:59,400 --> 06:40:01,400
in this or in memory or in

9155
06:40:01,400 --> 06:40:04,300
both the class Storage
level also decides

9156
06:40:04,300 --> 06:40:06,594
whether the RADS
should be serialized

9157
06:40:06,594 --> 06:40:09,480
or replicate its partition
for the final

9158
06:40:09,480 --> 06:40:12,000
and the last topic
for the today's list

9159
06:40:12,000 --> 06:40:15,100
is MLM blog MLM is
the machine learning APA

9160
06:40:15,100 --> 06:40:17,000
which is provided by spark,

9161
06:40:17,000 --> 06:40:18,600
which is also present in Python.

9162
06:40:18,700 --> 06:40:21,180
And this library
is heavily used in Python

9163
06:40:21,180 --> 06:40:22,597
for machine learning as

9164
06:40:22,597 --> 06:40:26,094
well as real-time streaming
analytics Aurelius algorithm

9165
06:40:26,094 --> 06:40:28,773
supported by this libraries
are first of all,

9166
06:40:28,773 --> 06:40:30,600
we have the spark dot m l live

9167
06:40:30,600 --> 06:40:33,482
now recently the spice
Park MN lips supports model

9168
06:40:33,482 --> 06:40:37,500
based collaborative filtering
by a small set of latent factors

9169
06:40:37,500 --> 06:40:40,500
and here all the users
and the products are described

9170
06:40:40,500 --> 06:40:42,300
which we can use
to predict them.

9171
06:40:42,300 --> 06:40:45,909
Missing entries however
to learn these latent factors

9172
06:40:45,909 --> 06:40:48,886
Park dot ml abuses
the alternatingly square

9173
06:40:48,886 --> 06:40:50,755
which is the ALS algorithm.

9174
06:40:50,755 --> 06:40:52,900
Next we have the MLF clustering

9175
06:40:52,900 --> 06:40:53,852
and are supervised

9176
06:40:53,852 --> 06:40:57,700
learning problem is clustering
now here we try to group subsets

9177
06:40:57,700 --> 06:40:59,989
of entities with one
another on the basis

9178
06:40:59,989 --> 06:41:02,000
of some notion of similarity.

9179
06:41:02,200 --> 06:41:02,500
Next.

9180
06:41:02,500 --> 06:41:04,500
We have the frequent
pattern matching,

9181
06:41:04,500 --> 06:41:08,400
which is the fpm now frequent
pattern matching is mining

9182
06:41:08,400 --> 06:41:12,800
frequent items item set
subsequences or other Lectures

9183
06:41:12,800 --> 06:41:13,600
that are usually

9184
06:41:13,600 --> 06:41:16,900
among the first steps to analyze
a large-scale data set.

9185
06:41:16,900 --> 06:41:20,600
This has been an active research
topic in data mining for years.

9186
06:41:20,600 --> 06:41:22,800
We have the linear algebra.

9187
06:41:23,000 --> 06:41:25,032
Now this algorithm
support spice Park,

9188
06:41:25,032 --> 06:41:27,403
I mean live utilities
for linear algebra.

9189
06:41:27,403 --> 06:41:29,300
We have collaborative filtering.

9190
06:41:29,400 --> 06:41:30,900
We have classification

9191
06:41:30,900 --> 06:41:34,000
for binary classification
various methods are available

9192
06:41:34,000 --> 06:41:37,700
in sparked MLA package such as
multi-class classification as

9193
06:41:37,700 --> 06:41:40,912
well as regression analysis
in classification some

9194
06:41:40,912 --> 06:41:44,067
of the most popular Terms
used are Nave by a strand

9195
06:41:44,067 --> 06:41:45,457
of forest decision tree

9196
06:41:45,457 --> 06:41:48,600
and so much and finally we
have the linear regression

9197
06:41:48,600 --> 06:41:51,300
now basically lead integration
comes from the family

9198
06:41:51,300 --> 06:41:54,064
of recreation algorithms
to find relationships

9199
06:41:54,064 --> 06:41:56,812
and dependencies between
variables is the main goal

9200
06:41:56,812 --> 06:41:58,594
of regression all the pie spark

9201
06:41:58,594 --> 06:42:01,400
MLA package also covers
other algorithm classes

9202
06:42:01,400 --> 06:42:02,100
and functions.

9203
06:42:02,400 --> 06:42:04,591
Let's now try to implement
all the concepts

9204
06:42:04,591 --> 06:42:07,200
which we have learned
in pie spark tutorial session

9205
06:42:07,200 --> 06:42:10,600
now here we are going to use
a heart disease prediction model

9206
06:42:10,600 --> 06:42:13,278
and we are going to predict
Using the decision tree

9207
06:42:13,278 --> 06:42:16,599
with the help of classification
as well as regression.

9208
06:42:16,599 --> 06:42:16,800
Now.

9209
06:42:16,800 --> 06:42:19,600
These all are part
of the ml Live library here.

9210
06:42:19,600 --> 06:42:21,800
Let's see how we
can perform these types

9211
06:42:21,800 --> 06:42:23,300
of functions and queries.

9212
06:42:39,800 --> 06:42:40,600
The first of all

9213
06:42:40,600 --> 06:42:43,700
what we need to do
is initialize the spark context.

9214
06:42:45,100 --> 06:42:48,300
Next we are going
to read the UCI data set

9215
06:42:48,400 --> 06:42:50,500
of the heart disease prediction

9216
06:42:50,600 --> 06:42:52,600
and we are going
to clean the data.

9217
06:42:52,600 --> 06:42:55,700
So let's import the pandas
and the numpy library here.

9218
06:42:56,000 --> 06:42:58,852
Let's create a data frame
as heart disease TF and

9219
06:42:58,852 --> 06:43:00,100
as mentioned earlier,

9220
06:43:00,100 --> 06:43:03,544
we are going to use
the read CSV method here

9221
06:43:03,700 --> 06:43:05,300
and here we don't have a header.

9222
06:43:05,300 --> 06:43:07,500
So we have provided
header as none.

9223
06:43:07,700 --> 06:43:10,800
Now the original data set
contains 300 3 rows

9224
06:43:10,800 --> 06:43:12,100
and 14 columns.

9225
06:43:12,600 --> 06:43:15,800
Now the categories
of diagnosis of heart disease

9226
06:43:15,900 --> 06:43:17,000
that we are projecting

9227
06:43:17,300 --> 06:43:22,400
if the value 0 is for 50% less
than narrowing and for the value

9228
06:43:22,400 --> 06:43:24,900
1 which we are giving
is for the values

9229
06:43:24,900 --> 06:43:27,500
which have 50% more
diameter of naren.

9230
06:43:28,700 --> 06:43:31,623
So here we are using
the numpy library.

9231
06:43:32,700 --> 06:43:35,921
These are particularly
old methods which is showing

9232
06:43:35,921 --> 06:43:39,400
the deprecated warning
but no issues it will work fine.

9233
06:43:40,900 --> 06:43:42,500
So as you can see here,

9234
06:43:42,500 --> 06:43:45,300
we have the categories
of diagnosis of heart disease

9235
06:43:45,300 --> 06:43:48,100
that we are predicting
the value 0 is 4 less than 50

9236
06:43:48,100 --> 06:43:50,000
and value 1 is greater than 50.

9237
06:43:50,400 --> 06:43:53,014
So what we did here
was clear the row

9238
06:43:53,014 --> 06:43:57,500
which have the question mark
or which have the empty spaces.

9239
06:43:58,700 --> 06:44:00,900
Now to get a look
at the data set here.

9240
06:44:00,900 --> 06:44:02,200
Now, you can see here.

9241
06:44:02,200 --> 06:44:06,086
We have zero at many places
instead of the question mark

9242
06:44:06,086 --> 06:44:07,500
which we had earlier

9243
06:44:08,600 --> 06:44:11,300
and now we are saving
it to a txt file.

9244
06:44:12,000 --> 06:44:14,200
And you can see her
after dropping the rose

9245
06:44:14,200 --> 06:44:15,494
with any empty values.

9246
06:44:15,494 --> 06:44:18,000
We have two ninety seven rows
and 14 columns.

9247
06:44:18,300 --> 06:44:20,800
But this is what the new
clear data set looks

9248
06:44:20,800 --> 06:44:24,400
like now we are importing
the ml lived library

9249
06:44:24,400 --> 06:44:26,500
and the regression here now here

9250
06:44:26,500 --> 06:44:29,077
what we are going to do
is create a label point,

9251
06:44:29,077 --> 06:44:31,900
which is a local Vector
associated with a label

9252
06:44:31,900 --> 06:44:33,100
or a response.

9253
06:44:33,100 --> 06:44:36,600
So for that we need to import
the MLF dot regression.

9254
06:44:37,800 --> 06:44:39,600
So for that we are
taking the text file

9255
06:44:39,600 --> 06:44:43,000
which we just created now
without the missing values.

9256
06:44:43,000 --> 06:44:43,665
Now next.

9257
06:44:43,665 --> 06:44:47,678
What we are going to do is
pass the MLA data line by line

9258
06:44:47,678 --> 06:44:49,900
into the MLM label Point object

9259
06:44:49,900 --> 06:44:51,671
and we are going
to convert the -

9260
06:44:51,671 --> 06:44:53,000
one labels to the 0 now.

9261
06:44:53,000 --> 06:44:56,200
Let's have a look after passing
the number of fishing lines.

9262
06:44:57,800 --> 06:45:00,200
Okay, we have to label .01.

9263
06:45:00,600 --> 06:45:01,700
That's cool.

9264
06:45:01,700 --> 06:45:04,700
Now next what we are going to do
is perform classification using

9265
06:45:04,700 --> 06:45:05,800
the decision tree.

9266
06:45:05,800 --> 06:45:09,300
So for that we need to import
the pie spark the ml 8.3.

9267
06:45:09,600 --> 06:45:13,200
So next what we have to do is
split the data into the training

9268
06:45:13,200 --> 06:45:14,300
and testing data

9269
06:45:14,300 --> 06:45:18,500
and we split here the data
into 70s 233 standard ratio,

9270
06:45:18,600 --> 06:45:20,672
70 being the training data set

9271
06:45:20,672 --> 06:45:24,541
and the 30% being the testing
data set next what we do is

9272
06:45:24,541 --> 06:45:26,200
that we train the model.

9273
06:45:26,200 --> 06:45:28,600
Which we are created here
using the training set.

9274
06:45:29,100 --> 06:45:31,100
We have created
a training model decision trees

9275
06:45:31,100 --> 06:45:32,400
or trained classifier.

9276
06:45:32,400 --> 06:45:34,400
We have used
a training data number

9277
06:45:34,400 --> 06:45:36,947
of classes is file
the categorical feature,

9278
06:45:36,947 --> 06:45:38,104
which we have given

9279
06:45:38,104 --> 06:45:40,600
maximum depth to which
we are classifying.

9280
06:45:40,600 --> 06:45:42,000
It is 3 the next

9281
06:45:42,000 --> 06:45:45,505
what we are going to do is
evaluate the model based

9282
06:45:45,505 --> 06:45:49,000
on the test data set now
and evaluate the error.

9283
06:45:49,300 --> 06:45:50,800
So here we are creating

9284
06:45:50,800 --> 06:45:53,211
predictions and we
are using the test data

9285
06:45:53,211 --> 06:45:55,800
to get the predictions
through the model

9286
06:45:55,800 --> 06:45:58,200
which we Do and we
are also going to find

9287
06:45:58,200 --> 06:45:59,500
the test errors here.

9288
06:45:59,700 --> 06:46:00,900
So as you can see here,

9289
06:46:00,900 --> 06:46:04,507
the test error is
zero point 2 2 9 7 we

9290
06:46:04,507 --> 06:46:08,200
have created a classification
decision tree model

9291
06:46:08,200 --> 06:46:11,100
in which the feature
less than 12 is 3 the value

9292
06:46:11,100 --> 06:46:13,225
of the features
distance 0 is 54.

9293
06:46:13,225 --> 06:46:16,014
So as you can see
our model is pretty good.

9294
06:46:16,014 --> 06:46:19,700
So now next we'll use regression
for the same purposes.

9295
06:46:19,700 --> 06:46:22,300
So let's perform the regression
using decision tree.

9296
06:46:22,500 --> 06:46:24,500
So as you can see
we have the train model

9297
06:46:24,500 --> 06:46:26,400
and we are using
the decision tree, too.

9298
06:46:26,400 --> 06:46:29,460
Trine request using
the training data the same

9299
06:46:29,460 --> 06:46:33,200
which we created using the
decision tree model up there.

9300
06:46:33,200 --> 06:46:34,811
We use the classification

9301
06:46:34,811 --> 06:46:37,440
now we are using
regression now similarly.

9302
06:46:37,440 --> 06:46:38,921
We are going to evaluate

9303
06:46:38,921 --> 06:46:42,500
our model using our test data
set and find that test errors

9304
06:46:42,500 --> 06:46:45,600
which is the mean squared error
here for aggression.

9305
06:46:45,600 --> 06:46:48,200
So let's have a look
at the mean square error here.

9306
06:46:48,200 --> 06:46:50,584
The mean square error is 0.168.

9307
06:46:50,800 --> 06:46:52,100
That is good.

9308
06:46:52,100 --> 06:46:53,318
Now finally if we have

9309
06:46:53,318 --> 06:46:55,700
a look at the Learned
regression tree model.

9310
06:46:56,800 --> 06:47:00,300
You can see we have created
the regression tree model

9311
06:47:00,300 --> 06:47:02,800
till the depth
of 3 with 15 notes.

9312
06:47:02,800 --> 06:47:04,577
And here we have
all the features

9313
06:47:04,577 --> 06:47:06,300
and classification of the tree.

9314
06:47:11,000 --> 06:47:11,675
Hello folks.

9315
06:47:11,675 --> 06:47:13,700
Welcome to spawn
interview questions.

9316
06:47:13,800 --> 06:47:16,949
The session has been planned
collectively to have commonly

9317
06:47:16,949 --> 06:47:19,988
asked interview questions later
to the smart technology

9318
06:47:19,988 --> 06:47:22,400
and the general answer
and the expectation

9319
06:47:22,400 --> 06:47:25,594
is already you are aware
of this particular technology.

9320
06:47:25,594 --> 06:47:29,200
To some extent and in general
the common questions being asked

9321
06:47:29,200 --> 06:47:31,500
as well as I will give
interaction with the technology

9322
06:47:31,500 --> 06:47:33,600
as so let's get this started.

9323
06:47:33,600 --> 06:47:36,023
So the agenda for
this particular session is

9324
06:47:36,023 --> 06:47:38,197
the basic questions
are going to cover

9325
06:47:38,197 --> 06:47:41,138
and questions later
to the spark core Technologies.

9326
06:47:41,138 --> 06:47:42,400
That's when I say spark

9327
06:47:42,400 --> 06:47:44,900
or that's going to be
the base and top

9328
06:47:44,900 --> 06:47:48,075
of spark or we have
four important components

9329
06:47:48,075 --> 06:47:50,669
which work that
is streaming Graphics.

9330
06:47:50,669 --> 06:47:53,100
Ml Abe and SQL
all these components

9331
06:47:53,100 --> 06:47:57,500
have been created to satisfy a
The government again interaction

9332
06:47:57,500 --> 06:47:59,495
with these Technologies and get

9333
06:47:59,495 --> 06:48:02,200
into the commonly
asked interview questions

9334
06:48:02,300 --> 06:48:04,500
and the questions also
framed such a way.

9335
06:48:04,500 --> 06:48:07,200
It covers the spectrum
of the doubts as well

9336
06:48:07,200 --> 06:48:10,600
as the features available
within that specific technology.

9337
06:48:10,600 --> 06:48:12,512
So let's take the first question

9338
06:48:12,512 --> 06:48:15,800
and look into the answer like
how commonly this covered.

9339
06:48:15,800 --> 06:48:19,800
What is Apache spark and Spark
It's with Apache Foundation now,

9340
06:48:20,000 --> 06:48:21,000
it's open source.

9341
06:48:21,000 --> 06:48:22,809
It's a cluster
Computing framework

9342
06:48:22,809 --> 06:48:24,280
for real-time processing.

9343
06:48:24,280 --> 06:48:25,750
So three main keywords over.

9344
06:48:25,750 --> 06:48:28,151
Here a purchase markets
are open source project.

9345
06:48:28,151 --> 06:48:29,856
It's used for cluster Computing.

9346
06:48:29,856 --> 06:48:33,272
And for a memory processing
along with real-time processing.

9347
06:48:33,272 --> 06:48:35,485
It's going to support
in memory Computing.

9348
06:48:35,485 --> 06:48:36,672
So the lots of project

9349
06:48:36,672 --> 06:48:38,400
which supports cluster Computing

9350
06:48:38,400 --> 06:48:42,100
along with that spark
differentiate Itself by doing

9351
06:48:42,100 --> 06:48:43,839
the in-memory Computing.

9352
06:48:43,839 --> 06:48:46,231
It's very active
community and out

9353
06:48:46,231 --> 06:48:50,000
of the Hadoop ecosystem
technology is Apache spark is

9354
06:48:50,000 --> 06:48:51,500
very active multiple releases.

9355
06:48:51,500 --> 06:48:52,800
We got last year.

9356
06:48:52,800 --> 06:48:56,750
It's a very inactive project
among the about your Basically,

9357
06:48:56,750 --> 06:49:00,072
it's a framework kind support
in memory Computing

9358
06:49:00,072 --> 06:49:04,100
and cluster Computing and you
may face this specific question

9359
06:49:04,100 --> 06:49:05,700
how spark is different

9360
06:49:05,700 --> 06:49:08,085
than mapreduce on
how you can compare it

9361
06:49:08,085 --> 06:49:11,400
with the mapreduce mapreduce
is the processing pathology

9362
06:49:11,400 --> 06:49:12,900
within the Hadoop ecosystem

9363
06:49:12,900 --> 06:49:14,400
and within Hadoop ecosystem.

9364
06:49:14,400 --> 06:49:18,700
We have hdfs Hadoop distributed
file system mapreduce going

9365
06:49:18,700 --> 06:49:23,300
to support distributed computing
and how spark is different.

9366
06:49:23,300 --> 06:49:25,900
So how we can compare
smart with them.

9367
06:49:25,900 --> 06:49:28,907
Mapreduce in a way
this comparison going

9368
06:49:28,907 --> 06:49:32,400
to help us to understand
the technology better.

9369
06:49:32,400 --> 06:49:33,100
But definitely

9370
06:49:33,100 --> 06:49:36,600
like we cannot compare these two
or two different methodologies

9371
06:49:36,600 --> 06:49:40,200
by which it's going to work
spark is very simple to program

9372
06:49:40,200 --> 06:49:42,700
but mapreduce there
is no abstraction

9373
06:49:42,700 --> 06:49:44,118
or the sense like all

9374
06:49:44,118 --> 06:49:47,900
the implementations we have
to provide and interactivity.

9375
06:49:47,900 --> 06:49:52,200
It's has an interactive mode to
work with inspark a mapreduce.

9376
06:49:52,200 --> 06:49:53,800
That is no interactive mode.

9377
06:49:53,800 --> 06:49:55,900
There are some
components like Apache.

9378
06:49:55,900 --> 06:49:56,800
Big and high

9379
06:49:56,800 --> 06:50:00,400
which facilitates has to do
the interactive Computing

9380
06:50:00,400 --> 06:50:02,145
or interactive programming

9381
06:50:02,145 --> 06:50:05,100
and smog supports
real-time stream processing

9382
06:50:05,100 --> 06:50:07,700
and to precisely
say with inspark

9383
06:50:07,700 --> 06:50:11,000
the stream processing is called
a near real-time processing.

9384
06:50:11,000 --> 06:50:13,600
There's nothing in the world
is Real Time processing.

9385
06:50:13,600 --> 06:50:15,100
It's near real-time processing.

9386
06:50:15,100 --> 06:50:18,200
It's going to do the processing
and micro batches.

9387
06:50:18,200 --> 06:50:19,200
I'll cover in detail

9388
06:50:19,200 --> 06:50:21,400
when we are moving
onto the streaming concept

9389
06:50:21,400 --> 06:50:22,600
and you're going to do

9390
06:50:22,600 --> 06:50:25,700
the batch processing on
the historical data in Matrix.

9391
06:50:25,700 --> 06:50:28,300
Zeus when I say stream
processing I will get the data

9392
06:50:28,300 --> 06:50:31,025
that is getting processed
in real time and do

9393
06:50:31,025 --> 06:50:33,849
the processing and get
the result either store it

9394
06:50:33,849 --> 06:50:35,772
on publish to publish Community.

9395
06:50:35,772 --> 06:50:37,697
We will be doing it let and see

9396
06:50:37,697 --> 06:50:40,149
wise mapreduce will have
very high latency

9397
06:50:40,149 --> 06:50:42,915
because it has to read
the data from hard disk,

9398
06:50:42,915 --> 06:50:45,200
but spark it will have
very low latency

9399
06:50:45,200 --> 06:50:47,200
because it can reprocess

9400
06:50:47,200 --> 06:50:50,500
are used the data
already cased in memory,

9401
06:50:50,500 --> 06:50:53,786
but there is a small catch
over here in spark first time

9402
06:50:53,786 --> 06:50:56,600
when the data gets loaded it
has Tool to read it

9403
06:50:56,600 --> 06:50:59,100
from the hard disk
same as mapreduce.

9404
06:50:59,100 --> 06:51:01,600
So once it is red it
will be there in the memory.

9405
06:51:01,692 --> 06:51:03,000
So spark is good.

9406
06:51:03,000 --> 06:51:05,100
Whenever we need to do I treat

9407
06:51:05,100 --> 06:51:08,900
a Computing so spark whenever
you do I treat a Computing again

9408
06:51:08,900 --> 06:51:11,400
and again to the processing
on the same data,

9409
06:51:11,400 --> 06:51:14,200
especially in machine learning
deep learning all we will be

9410
06:51:14,200 --> 06:51:17,900
using the iterative Computing
his Fox performs much better.

9411
06:51:17,900 --> 06:51:19,805
You will see
the rock performance

9412
06:51:19,805 --> 06:51:22,651
Improvement hundred times
faster than mapreduce.

9413
06:51:22,651 --> 06:51:25,800
But if it is one time processing
and fire-and-forget,

9414
06:51:25,800 --> 06:51:28,805
Get the type
of processing spark lately,

9415
06:51:28,805 --> 06:51:30,600
maybe the same latency,

9416
06:51:30,600 --> 06:51:32,699
you will be getting
a tan mapreduce maybe

9417
06:51:32,699 --> 06:51:35,900
like some improvements because
of the building block or spark.

9418
06:51:35,900 --> 06:51:38,800
That's the ID you may get
some additional Advantage.

9419
06:51:38,800 --> 06:51:43,000
So that's the key feature are
the key comparison factor

9420
06:51:43,300 --> 06:51:45,200
of sparkin mapreduce.

9421
06:51:45,800 --> 06:51:50,100
Now, let's get on to the key
features xnk features of spark.

9422
06:51:50,200 --> 06:51:52,200
We discussed over
the Speed and Performance.

9423
06:51:52,200 --> 06:51:54,200
It's going to use
the in-memory Computing

9424
06:51:54,200 --> 06:51:55,559
so Speed and Performance.

9425
06:51:55,559 --> 06:51:57,300
Place it's going to much better.

9426
06:51:57,300 --> 06:52:00,900
When we do actually to Computing
and Somali got the sense

9427
06:52:00,900 --> 06:52:03,810
the programming language
to be used with a spark.

9428
06:52:03,810 --> 06:52:06,700
It can be any of these languages
can be python.

9429
06:52:06,700 --> 06:52:08,400
Java are our scale.

9430
06:52:08,400 --> 06:52:08,570
Mm.

9431
06:52:08,570 --> 06:52:11,300
We can do programming
with any of these languages

9432
06:52:11,300 --> 06:52:14,200
and data formats
to give us a input.

9433
06:52:14,200 --> 06:52:17,172
We can give any data formats
like Jason back

9434
06:52:17,172 --> 06:52:18,900
with a data formats began

9435
06:52:18,900 --> 06:52:21,888
if there is a input
and the key selling point

9436
06:52:21,888 --> 06:52:24,400
with the spark is it's
lazy evaluation the

9437
06:52:24,400 --> 06:52:25,575
since it's going

9438
06:52:25,575 --> 06:52:29,100
To calculate the DAC cycle
directed acyclic graph

9439
06:52:29,100 --> 06:52:32,700
d a g because that is a th e
it's going to calculate

9440
06:52:32,700 --> 06:52:35,300
what all steps needs
to be executed to achieve

9441
06:52:35,300 --> 06:52:36,400
the final result.

9442
06:52:36,400 --> 06:52:38,969
So we need to give all
the steps as well as

9443
06:52:38,969 --> 06:52:40,519
what final result I want.

9444
06:52:40,519 --> 06:52:42,983
It's going to calculate
the optimal cycle

9445
06:52:42,983 --> 06:52:44,400
on optimal calculation.

9446
06:52:44,400 --> 06:52:46,400
What else tips needs
to be calculated

9447
06:52:46,400 --> 06:52:49,100
or what else tips needs
to be executed only those steps

9448
06:52:49,100 --> 06:52:50,500
it will be executing it.

9449
06:52:50,500 --> 06:52:52,900
So basically it's
a lazy execution only

9450
06:52:52,900 --> 06:52:54,450
if the results needs
to be processed,

9451
06:52:54,450 --> 06:52:55,800
it will be processing that.

9452
06:52:55,800 --> 06:52:58,623
Because of it and it's
about real-time Computing.

9453
06:52:58,623 --> 06:53:00,200
It's through spark streaming

9454
06:53:00,200 --> 06:53:02,200
that is a component
called spark streaming

9455
06:53:02,200 --> 06:53:04,700
which supports real-time
Computing and it gels

9456
06:53:04,700 --> 06:53:07,115
with Hadoop ecosystem variable.

9457
06:53:07,115 --> 06:53:09,500
It can run on top of Hadoop Ian

9458
06:53:09,500 --> 06:53:12,562
or it can Leverage The hdfs
to do the processing.

9459
06:53:12,562 --> 06:53:16,300
So when it leverages the hdfs
the Hadoop cluster container

9460
06:53:16,300 --> 06:53:19,400
can be used to do
the distributed computing

9461
06:53:19,400 --> 06:53:23,707
as well as it can leverage
the resource manager to manage

9462
06:53:23,707 --> 06:53:25,400
the resources so spot.

9463
06:53:25,400 --> 06:53:28,426
I can gel with the hdfs very
well as well as it can leverage

9464
06:53:28,426 --> 06:53:29,642
the resource manager

9465
06:53:29,642 --> 06:53:32,500
to share the resources
as well as data locality.

9466
06:53:32,500 --> 06:53:34,699
You can give each data locality.

9467
06:53:34,699 --> 06:53:36,900
It can do the processing we have

9468
06:53:36,900 --> 06:53:41,200
to the database data is located
within the hdfs and has a fleet

9469
06:53:41,200 --> 06:53:43,700
of machine learning
algorithms already implemented

9470
06:53:43,700 --> 06:53:46,100
right from clustering
classification regression.

9471
06:53:46,100 --> 06:53:48,238
All this logic
already implemented

9472
06:53:48,238 --> 06:53:49,600
and machine learning.

9473
06:53:49,600 --> 06:53:52,400
It's achieved using
MLA be within spark

9474
06:53:52,400 --> 06:53:54,800
and there is a component
called a graphics

9475
06:53:54,800 --> 06:53:58,600
which supports Maybe we
can solve the problems using

9476
06:53:58,600 --> 06:54:02,600
graph Theory using the component
Graphics within this park.

9477
06:54:02,700 --> 06:54:04,700
So these are the things
we can consider as

9478
06:54:04,700 --> 06:54:06,700
the key features of spark.

9479
06:54:06,700 --> 06:54:09,400
So when you discuss
with the installation

9480
06:54:09,400 --> 06:54:10,300
of the spark,

9481
06:54:10,300 --> 06:54:13,581
you may come across this year
on what is he on do you

9482
06:54:13,581 --> 06:54:16,765
need to install spark
on all nodes of young cluster?

9483
06:54:16,765 --> 06:54:19,700
So yarn is nothing
but another is US negotiator.

9484
06:54:19,700 --> 06:54:22,500
That's the resource manager
within the Hadoop ecosystem.

9485
06:54:22,500 --> 06:54:25,529
So that's going to provide the
resource management platform.

9486
06:54:25,529 --> 06:54:28,200
Ian going to provide
the resource management platform

9487
06:54:28,200 --> 06:54:29,500
across all the Clusters

9488
06:54:29,600 --> 06:54:33,200
and Spark It's going
to provide the data processing.

9489
06:54:33,200 --> 06:54:35,300
So wherever there is
a horse being used

9490
06:54:35,300 --> 06:54:38,049
that location response will be
used to do the data processing.

9491
06:54:38,049 --> 06:54:39,056
And of course, yes,

9492
06:54:39,056 --> 06:54:41,600
we need to have spark
installed on all the nodes.

9493
06:54:41,800 --> 06:54:43,900
It's Parker stores are located.

9494
06:54:43,900 --> 06:54:47,100
That's basically we need
those libraries an additional

9495
06:54:47,100 --> 06:54:50,200
to the installation of spark
and all the worker nodes.

9496
06:54:50,200 --> 06:54:52,106
We need to increase
the ram capacity

9497
06:54:52,106 --> 06:54:53,283
on the VOC emissions

9498
06:54:53,283 --> 06:54:55,800
as well as far going
to consume huge amounts.

9499
06:54:56,100 --> 06:55:00,500
Memory to do the processing it
will not do the mapreduce way

9500
06:55:00,500 --> 06:55:01,600
of working internally.

9501
06:55:01,600 --> 06:55:04,191
It's going to generate
the next cycle and do

9502
06:55:04,191 --> 06:55:06,000
the processing on top of yeah,

9503
06:55:06,000 --> 06:55:09,900
so Ian and the high level it's
like resource manager

9504
06:55:09,900 --> 06:55:13,100
or like an operating system
for the distributed computing.

9505
06:55:13,100 --> 06:55:15,500
It's going to coordinate
all the resource management

9506
06:55:15,500 --> 06:55:17,900
across the fleet
of servers on top of it.

9507
06:55:17,900 --> 06:55:20,100
I can have multiple components

9508
06:55:20,100 --> 06:55:25,100
like spark these giraffe
this park especially it's going

9509
06:55:25,100 --> 06:55:27,800
to help Just watch it
in memory Computing.

9510
06:55:27,800 --> 06:55:30,900
So sparkly on is nothing
but it's a resource manager

9511
06:55:30,900 --> 06:55:33,600
to manage the resource
across the cluster on top of it.

9512
06:55:33,600 --> 06:55:35,470
We can have spunk and yes,

9513
06:55:35,470 --> 06:55:37,700
we need to have spark installed

9514
06:55:37,700 --> 06:55:41,800
and all the notes on where
the spark yarn cluster is used

9515
06:55:41,800 --> 06:55:43,581
and also additional to that.

9516
06:55:43,581 --> 06:55:45,809
We need to have
the memory increased

9517
06:55:45,809 --> 06:55:47,400
in all the worker robots.

9518
06:55:47,600 --> 06:55:48,870
The next question goes

9519
06:55:48,870 --> 06:55:51,400
like this what file
system response support.

9520
06:55:52,300 --> 06:55:55,779
What is the file system then
we work in individual system.

9521
06:55:55,779 --> 06:55:58,100
We will be having
a file system to work

9522
06:55:58,100 --> 06:56:01,000
within that particular
operating system Mary

9523
06:56:01,000 --> 06:56:04,900
redistributed cluster or in
the distributed architecture.

9524
06:56:04,900 --> 06:56:06,744
We need a file system with which

9525
06:56:06,744 --> 06:56:09,800
where we can store the data
in a distribute mechanism.

9526
06:56:09,800 --> 06:56:12,900
How do comes with
the file system called hdfs.

9527
06:56:13,100 --> 06:56:15,800
It's called Hadoop
distributed file system

9528
06:56:15,800 --> 06:56:19,131
by data gets distributed
across multiple systems

9529
06:56:19,131 --> 06:56:21,400
and it will be coordinated by 2.

9530
06:56:21,400 --> 06:56:24,500
Different type of components
called name node and data node

9531
06:56:24,500 --> 06:56:27,800
and Spark it can use
this hdfs directly.

9532
06:56:27,800 --> 06:56:30,900
So you can have any files
in hdfs and start using it

9533
06:56:30,900 --> 06:56:34,800
within the spark ecosystem
and it gives another advantage

9534
06:56:34,800 --> 06:56:35,900
of data locality

9535
06:56:35,900 --> 06:56:38,415
when it does the distributed
processing wherever

9536
06:56:38,415 --> 06:56:39,700
the data is distributed.

9537
06:56:39,700 --> 06:56:42,400
The processing could be done
locally to that particular

9538
06:56:42,400 --> 06:56:44,300
Mission way data is located

9539
06:56:44,300 --> 06:56:47,223
and to start with as
a standalone mode.

9540
06:56:47,223 --> 06:56:49,500
You can use the local
file system aspect.

9541
06:56:49,600 --> 06:56:51,508
So this could be used especially

9542
06:56:51,508 --> 06:56:53,818
when we are doing
the development or any

9543
06:56:53,818 --> 06:56:56,390
of you see you can use
the local file system

9544
06:56:56,390 --> 06:56:59,500
and Amazon Cloud provides
another file system called.

9545
06:56:59,500 --> 06:57:02,119
Yes, three simple
storage service we call

9546
06:57:02,119 --> 06:57:03,100
that is the S3.

9547
06:57:03,100 --> 06:57:04,998
It's a block storage service.

9548
06:57:04,998 --> 06:57:06,700
This can also be leveraged

9549
06:57:06,700 --> 06:57:09,238
or used within spa
for the storage

9550
06:57:09,800 --> 06:57:11,100
and lot other file system.

9551
06:57:11,100 --> 06:57:14,700
Also, it supports there are
some file systems like Alex,

9552
06:57:14,700 --> 06:57:17,700
oh which provides
in memory storage

9553
06:57:17,700 --> 06:57:20,800
so we can leverage that
particular file system as well.

9554
06:57:21,100 --> 06:57:22,796
So we have seen
all the features.

9555
06:57:22,796 --> 06:57:25,580
What are the functionalities
available with inspark?

9556
06:57:25,580 --> 06:57:27,600
We're going to look
at the limitations

9557
06:57:27,600 --> 06:57:28,800
of using spark.

9558
06:57:28,800 --> 06:57:30,252
Of course every component

9559
06:57:30,252 --> 06:57:33,000
when it comes with
a huge power and Advantage.

9560
06:57:33,000 --> 06:57:35,200
It will have its own
limitations as well.

9561
06:57:35,300 --> 06:57:38,900
The equation illustrates
some limitations of using

9562
06:57:38,900 --> 06:57:41,900
spark spark utilizes
more storage space

9563
06:57:41,900 --> 06:57:43,400
compared to Hadoop

9564
06:57:43,400 --> 06:57:44,715
and it comes
to the installation.

9565
06:57:44,715 --> 06:57:47,600
It's going to consume more space
but in the Big Data world,

9566
06:57:47,600 --> 06:57:49,500
that's not a
very huge constraint

9567
06:57:49,500 --> 06:57:52,206
because storage cons is
not Great are very high

9568
06:57:52,206 --> 06:57:55,504
and our big data space and
developer needs to be careful

9569
06:57:55,504 --> 06:57:58,275
while running the apps
and Spark the reason

9570
06:57:58,275 --> 06:58:00,300
because it uses
in-memory Computing.

9571
06:58:00,400 --> 06:58:02,870
Of course, it handles
the memory very well.

9572
06:58:02,870 --> 06:58:05,400
But if you try to load
a huge amount of data

9573
06:58:05,400 --> 06:58:08,700
and the distributed environment
and if you try to do is join

9574
06:58:08,700 --> 06:58:09,903
when you try to do join

9575
06:58:09,903 --> 06:58:13,491
within the distributed world the
data going to get transferred

9576
06:58:13,491 --> 06:58:14,700
over the network network

9577
06:58:14,700 --> 06:58:18,100
is really a costly
resource So the plan

9578
06:58:18,200 --> 06:58:20,800
or design should be such
a way to reduce or minimize.

9579
06:58:20,800 --> 06:58:23,500
As the data transferred
over the network

9580
06:58:23,500 --> 06:58:27,103
and however the way
possible with all possible means

9581
06:58:27,103 --> 06:58:30,000
we should facilitate
distribution of theta

9582
06:58:30,000 --> 06:58:32,200
over multiple missions the more

9583
06:58:32,200 --> 06:58:34,600
we distribute the more
parallelism we can achieve

9584
06:58:34,600 --> 06:58:38,500
and the more results we can get
and cost efficiency.

9585
06:58:38,500 --> 06:58:40,700
If you try to compare the cost

9586
06:58:40,700 --> 06:58:42,800
how much cost involved

9587
06:58:42,800 --> 06:58:45,700
to do a particular
processing take any unit

9588
06:58:45,700 --> 06:58:48,545
in terms of processing
1 GB of data with say

9589
06:58:48,545 --> 06:58:50,200
like II Treaty processing

9590
06:58:50,200 --> 06:58:53,800
if you come Cost-wise in-memory
Computing always it's considered

9591
06:58:53,800 --> 06:58:57,088
because memory It's
relatively come costlier

9592
06:58:57,088 --> 06:58:58,200
than the storage

9593
06:58:58,400 --> 06:59:00,000
so that may act
like a bottleneck

9594
06:59:00,000 --> 06:59:01,400
and we cannot increase

9595
06:59:01,400 --> 06:59:05,200
the memory capacity of
the mission Beyond supplement.

9596
06:59:05,900 --> 06:59:07,500
So we have to grow horizontally.

9597
06:59:07,800 --> 06:59:10,042
So when we have
the data distributor

9598
06:59:10,042 --> 06:59:11,900
in memory across the cluster,

9599
06:59:12,000 --> 06:59:13,337
of course the network transfer

9600
06:59:13,337 --> 06:59:15,300
all those bottlenecks
will come into picture.

9601
06:59:15,300 --> 06:59:17,400
So we have to strike
the right balance

9602
06:59:17,400 --> 06:59:20,700
which will help us to achieve
the in-memory computing.

9603
06:59:20,700 --> 06:59:22,775
Whatever, they memory
computer repair it

9604
06:59:22,775 --> 06:59:24,000
will help us to achieve

9605
06:59:24,000 --> 06:59:25,757
and it consumes huge amount

9606
06:59:25,757 --> 06:59:28,400
of data processing
compared to Hadoop

9607
06:59:28,600 --> 06:59:30,600
and Spark it performs

9608
06:59:30,600 --> 06:59:33,800
better than use it as
a creative Computing

9609
06:59:33,800 --> 06:59:36,700
because it likes for both spark
and the other Technologies.

9610
06:59:36,700 --> 06:59:37,699
It has to read data

9611
06:59:37,699 --> 06:59:39,700
for the first time
from the hottest car

9612
06:59:39,700 --> 06:59:43,300
from other data source and Spark
performance is really better

9613
06:59:43,300 --> 06:59:46,114
when it reads the data
onto does the processing

9614
06:59:46,114 --> 06:59:48,500
when the data is available
in the cache,

9615
06:59:48,723 --> 06:59:50,800
of course is the DAC cycle.

9616
06:59:50,800 --> 06:59:53,094
It's going to give
us a lot of advantage

9617
06:59:53,094 --> 06:59:54,400
while doing the processing

9618
06:59:54,400 --> 06:59:56,802
but the in-memory
Computing processing

9619
06:59:56,802 --> 06:59:59,400
that's going to give
us lots of Leverage.

9620
06:59:59,400 --> 07:00:01,605
The next question
list some use cases

9621
07:00:01,605 --> 07:00:04,300
where Spark outperforms
Hadoop in processing.

9622
07:00:04,400 --> 07:00:06,300
The first thing is
the real time processing.

9623
07:00:06,300 --> 07:00:08,629
How do you cannot handle
real time processing

9624
07:00:08,629 --> 07:00:10,884
but spark and handle
real time processing.

9625
07:00:10,884 --> 07:00:13,843
So any data that's coming in
in the land architecture.

9626
07:00:13,843 --> 07:00:15,300
You will have three layers.

9627
07:00:15,300 --> 07:00:17,210
The most of the Big
Data projects will be

9628
07:00:17,210 --> 07:00:18,500
in the Lambda architecture.

9629
07:00:18,500 --> 07:00:21,500
You will have speed layer
by layer and sighs Leo

9630
07:00:21,500 --> 07:00:23,900
and the speed layer
whenever the river comes

9631
07:00:23,900 --> 07:00:26,900
in that needs to be processed
stored and handled.

9632
07:00:26,900 --> 07:00:27,975
So in those type

9633
07:00:27,975 --> 07:00:30,800
of real-time processing stock
is the best fit.

9634
07:00:30,800 --> 07:00:32,500
Of course, we can
Hadoop ecosystem.

9635
07:00:32,500 --> 07:00:33,837
We have other components

9636
07:00:33,837 --> 07:00:36,400
which does the real-time
processing like storm.

9637
07:00:36,400 --> 07:00:39,000
But when you want to Leverage
The Machine learning

9638
07:00:39,000 --> 07:00:40,500
along with the Sparks dreaming

9639
07:00:40,500 --> 07:00:43,200
on such computation spark
will be much better.

9640
07:00:43,200 --> 07:00:44,243
So that's why I like

9641
07:00:44,243 --> 07:00:45,621
when you have architecture

9642
07:00:45,621 --> 07:00:47,900
like a Lambda architecture
you want to have

9643
07:00:47,900 --> 07:00:51,100
all three layers bachelier
speed layer and service.

9644
07:00:51,100 --> 07:00:54,800
A spark and gel the speed layer
and service layer far better

9645
07:00:54,800 --> 07:00:56,800
and it's going to provide
better performance.

9646
07:00:56,800 --> 07:00:59,400
And whenever you do
the edge processing

9647
07:00:59,400 --> 07:01:02,400
especially like doing
a machine learning processing,

9648
07:01:02,400 --> 07:01:04,501
we will leverage
nitrate in Computing

9649
07:01:04,501 --> 07:01:06,210
and can perform a hundred times

9650
07:01:06,210 --> 07:01:08,800
faster than Hadoop
the more diversity processing

9651
07:01:08,800 --> 07:01:11,600
that we do the more data
will be read from the memory

9652
07:01:11,600 --> 07:01:14,700
and it's going to get as
much faster performance

9653
07:01:14,700 --> 07:01:16,700
than I did with mapreduce.

9654
07:01:16,700 --> 07:01:20,100
So again, remember whenever you
do the processing only buns,

9655
07:01:20,100 --> 07:01:23,000
so you're going to to do
the processing finally bonds

9656
07:01:23,000 --> 07:01:24,900
read process it and deliver.

9657
07:01:24,900 --> 07:01:27,516
The result spark
may not be the best fit

9658
07:01:27,516 --> 07:01:30,200
that can be done
with a mapreduce itself.

9659
07:01:30,200 --> 07:01:32,773
And there is another component
called akka it's

9660
07:01:32,773 --> 07:01:35,600
a messaging system
our message quantity

9661
07:01:35,600 --> 07:01:38,500
in system Sparkle
internally uses account

9662
07:01:38,500 --> 07:01:40,500
for scheduling our any task

9663
07:01:40,500 --> 07:01:43,100
that needs to be assigned
by the master to the worker

9664
07:01:43,700 --> 07:01:45,700
and the follow-up
of that particular task

9665
07:01:45,700 --> 07:01:49,000
by the master basically
asynchronous coordination system

9666
07:01:49,000 --> 07:01:51,000
and that's achieved using akka

9667
07:01:51,400 --> 07:01:55,100
I call programming internally
it's used by this monk

9668
07:01:55,100 --> 07:01:56,551
as such for the developers.

9669
07:01:56,551 --> 07:01:59,358
We don't need to worry
about a couple of growing up.

9670
07:01:59,358 --> 07:02:00,900
Of course we can leverage it

9671
07:02:00,900 --> 07:02:04,500
but the car is used internally
by the spawn for scheduling

9672
07:02:04,500 --> 07:02:08,800
and coordination between master
and the burqa and with inspark.

9673
07:02:08,800 --> 07:02:10,700
We have few major components.

9674
07:02:10,700 --> 07:02:13,200
Let's see, what are
the major components

9675
07:02:13,200 --> 07:02:14,500
of a possessed man.

9676
07:02:14,500 --> 07:02:18,069
The lay the components
of spot ecosystem start comes

9677
07:02:18,069 --> 07:02:19,319
with a core engine.

9678
07:02:19,319 --> 07:02:20,700
So that has the core.

9679
07:02:20,700 --> 07:02:23,570
Realities of what is required
from by the spark

9680
07:02:23,570 --> 07:02:26,600
of all this Punk Oddities
are the building blocks

9681
07:02:26,600 --> 07:02:29,361
of the spark core engine
on top of spark

9682
07:02:29,361 --> 07:02:31,300
or the basic functionalities are

9683
07:02:31,300 --> 07:02:34,600
file interaction file system
coordination all that's done

9684
07:02:34,600 --> 07:02:36,400
by the spark core engine

9685
07:02:36,400 --> 07:02:38,432
on top of spark core engine.

9686
07:02:38,432 --> 07:02:40,900
We have a number
of other offerings

9687
07:02:40,900 --> 07:02:44,700
to do machine learning to do
graph Computing to do streaming.

9688
07:02:44,700 --> 07:02:47,000
We have n number
of other components.

9689
07:02:47,000 --> 07:02:48,800
So the major use the components

9690
07:02:48,800 --> 07:02:51,000
of these components
like Sparks equal.

9691
07:02:51,000 --> 07:02:52,037
Spock streaming.

9692
07:02:52,037 --> 07:02:55,520
I'm a little graphics
and Spark our other high level.

9693
07:02:55,520 --> 07:02:58,400
We will see what are
these components Sparks

9694
07:02:58,400 --> 07:03:02,000
equal especially it's designed
to do the processing

9695
07:03:02,000 --> 07:03:03,729
against a structure data

9696
07:03:03,729 --> 07:03:07,400
so we can write SQL queries
and we can handle

9697
07:03:07,400 --> 07:03:08,854
or we can do the processing.

9698
07:03:08,854 --> 07:03:11,400
So it's going to give us
the interface to interact

9699
07:03:11,400 --> 07:03:12,100
with the data,

9700
07:03:12,300 --> 07:03:15,900
especially structure data
and other language

9701
07:03:15,900 --> 07:03:18,700
that we can use
it's more similar to

9702
07:03:18,700 --> 07:03:20,600
what we use within the SQL.

9703
07:03:20,600 --> 07:03:22,700
Well, I can say
99 percentage is seen

9704
07:03:22,700 --> 07:03:25,934
and most of the commonly used
functionalities within the SQL

9705
07:03:25,934 --> 07:03:28,111
have been implemented
within smocks equal

9706
07:03:28,111 --> 07:03:31,700
and Spark streaming is going to
support the stream processing.

9707
07:03:31,700 --> 07:03:34,000
That's the offering
available to handle

9708
07:03:34,000 --> 07:03:35,920
the stream processing and MLA

9709
07:03:35,920 --> 07:03:38,900
based the offering
to handle machine learning.

9710
07:03:38,900 --> 07:03:42,700
So the component name
is called ml in and has a list

9711
07:03:42,700 --> 07:03:44,300
of components a list

9712
07:03:44,300 --> 07:03:47,300
of machine learning
algorithms already defined

9713
07:03:47,300 --> 07:03:50,700
we can leverage and use any
of those machine learning.

9714
07:03:51,400 --> 07:03:54,944
Graphics again, it's
a graph processing offerings

9715
07:03:54,944 --> 07:03:56,200
within the spark.

9716
07:03:56,200 --> 07:03:59,141
It's going to support us
to achieve graph Computing

9717
07:03:59,141 --> 07:04:02,330
against the data that we have
like pagerank calculation.

9718
07:04:02,330 --> 07:04:04,107
How many connector identities

9719
07:04:04,107 --> 07:04:07,600
how many triangles all those
going to provide us a meaning

9720
07:04:07,600 --> 07:04:09,300
to that particular data

9721
07:04:09,300 --> 07:04:12,500
and Spark are is the component
is going to interact

9722
07:04:12,500 --> 07:04:14,371
or helpers to leverage.

9723
07:04:14,371 --> 07:04:17,856
The language are
within the spark environment

9724
07:04:18,100 --> 07:04:20,600
are is a statistical
programming language.

9725
07:04:20,600 --> 07:04:23,170
Each where we can do
statistical Computing,

9726
07:04:23,170 --> 07:04:24,700
which is Park environment

9727
07:04:24,700 --> 07:04:28,306
and we can leverage our language
by using this parka to get

9728
07:04:28,306 --> 07:04:32,194
that executed within the spark
a environment addition to that.

9729
07:04:32,194 --> 07:04:35,675
There are other components
as well like approximative is

9730
07:04:35,675 --> 07:04:39,118
it's called blink DB all other
things I can be test each.

9731
07:04:39,118 --> 07:04:42,541
So these are the major Lee used
components within spark.

9732
07:04:42,541 --> 07:04:43,561
So next question.

9733
07:04:43,561 --> 07:04:45,944
How can start be used
alongside her too?

9734
07:04:45,944 --> 07:04:49,000
So when we see a spark
performance much better it's

9735
07:04:49,000 --> 07:04:51,000
not a replacement to handle it.

9736
07:04:51,000 --> 07:04:52,100
Going to coexist

9737
07:04:52,100 --> 07:04:55,488
with the Hadoop right
Square leveraging the spark

9738
07:04:55,488 --> 07:04:56,900
and Hadoop together.

9739
07:04:56,900 --> 07:05:00,000
It's going to help us
to achieve the best result.

9740
07:05:00,000 --> 07:05:00,268
Yes.

9741
07:05:00,268 --> 07:05:04,300
Mark can do in memory Computing
or can handle the speed layer

9742
07:05:04,300 --> 07:05:06,600
and Hadoop comes
with the resource manager

9743
07:05:06,600 --> 07:05:08,500
so we can leverage
the resource manager

9744
07:05:08,500 --> 07:05:10,900
of Hadoop to make smart to work

9745
07:05:11,000 --> 07:05:13,529
and few processing be
don't need to Leverage

9746
07:05:13,529 --> 07:05:14,904
The in-memory Computing.

9747
07:05:14,904 --> 07:05:18,500
For example, one time processing
to the processing and forget.

9748
07:05:18,500 --> 07:05:20,773
I just store it we
can use mapreduce.

9749
07:05:20,773 --> 07:05:24,700
He's so the processing cost
Computing cost will be much less

9750
07:05:24,700 --> 07:05:26,100
compared to Spa

9751
07:05:26,100 --> 07:05:29,400
so we can amalgam eyes and get
strike the right balance

9752
07:05:29,400 --> 07:05:31,700
between the batch processing
and stream processing

9753
07:05:31,700 --> 07:05:34,507
when we have spark
along with Adam.

9754
07:05:34,507 --> 07:05:38,100
Let's have some detail question
later to spark core

9755
07:05:38,100 --> 07:05:39,100
with inspark or

9756
07:05:39,100 --> 07:05:41,900
as I mentioned earlier
the core building block

9757
07:05:41,900 --> 07:05:45,600
of spark or is our DD resilient
distributed data set.

9758
07:05:45,600 --> 07:05:46,654
It's a virtual.

9759
07:05:46,654 --> 07:05:48,442
It's not a physical entity.

9760
07:05:48,442 --> 07:05:49,900
It's a logical entity.

9761
07:05:49,900 --> 07:05:52,400
You will not See
this audit is existing.

9762
07:05:52,400 --> 07:05:54,700
The existence of hundred
will come into picture

9763
07:05:54,900 --> 07:05:56,474
when you take some action.

9764
07:05:56,474 --> 07:05:59,200
So this is our Unity
will be used are referred

9765
07:05:59,200 --> 07:06:00,800
to create the DAC cycle

9766
07:06:00,943 --> 07:06:05,500
and arteries will be optimized
to transform from one form

9767
07:06:05,500 --> 07:06:07,264
to another form to make a plan

9768
07:06:07,264 --> 07:06:09,400
how the data set needs
to be transformed

9769
07:06:09,400 --> 07:06:11,500
from one structure
to another structure.

9770
07:06:11,700 --> 07:06:14,817
And finally when you take some
against an RTD that existence

9771
07:06:14,817 --> 07:06:15,924
of the data structure

9772
07:06:15,924 --> 07:06:18,200
that resulted in data
will come into picture

9773
07:06:18,200 --> 07:06:20,500
and that can be stored
in any file system

9774
07:06:20,500 --> 07:06:22,000
whether it's GFS is 3

9775
07:06:22,000 --> 07:06:24,568
or any other file system
can be stored and

9776
07:06:24,568 --> 07:06:27,900
that it is can exist
in a partition form the sense.

9777
07:06:27,900 --> 07:06:30,600
It can get distributed
across multiple systems

9778
07:06:30,600 --> 07:06:33,800
and it's fault tolerant
and it's a fault tolerant.

9779
07:06:33,800 --> 07:06:36,494
If any of the artery
is lost any partition

9780
07:06:36,494 --> 07:06:37,742
of the RTD is lost.

9781
07:06:37,742 --> 07:06:40,700
It can regenerate only
that specific partition

9782
07:06:40,700 --> 07:06:41,700
it can regenerate

9783
07:06:41,900 --> 07:06:43,900
so that's a huge
advantage of our GD.

9784
07:06:43,900 --> 07:06:46,600
So it's a mass like first
the huge advantage of added.

9785
07:06:46,600 --> 07:06:47,900
It's a fault-tolerant

9786
07:06:47,900 --> 07:06:50,600
where it can regenerate
the last rdds.

9787
07:06:50,600 --> 07:06:53,606
And it can exist
in a distributed fashion

9788
07:06:53,606 --> 07:06:55,165
and it is immutable the

9789
07:06:55,165 --> 07:06:59,300
since once the RTD is defined on
like it it cannot be changed.

9790
07:06:59,300 --> 07:07:01,500
The next question is
how do we create rdds

9791
07:07:01,500 --> 07:07:04,500
in spark the two ways we
can create The Oddities one

9792
07:07:04,664 --> 07:07:09,700
as isn't the spark context we
can use any of the collections

9793
07:07:09,700 --> 07:07:12,700
that's available within this
scalar or in the Java and using

9794
07:07:12,700 --> 07:07:14,000
the paralyzed function.

9795
07:07:14,000 --> 07:07:17,049
We can create the RTD
and it's going to use

9796
07:07:17,049 --> 07:07:20,474
the underlying file
systems distribution mechanism

9797
07:07:20,474 --> 07:07:23,900
if The data is located
in distributed file system,

9798
07:07:23,900 --> 07:07:24,700
like hdfs.

9799
07:07:25,000 --> 07:07:27,154
It will leverage
that and it will make

9800
07:07:27,154 --> 07:07:30,331
those arteries available
in a number of systems.

9801
07:07:30,331 --> 07:07:33,696
So it's going to leverage
and follow the same distribution

9802
07:07:33,696 --> 07:07:34,700
and already Aspen

9803
07:07:34,700 --> 07:07:37,200
or we can create the rdt
by loading the data

9804
07:07:37,200 --> 07:07:39,835
from external sources
as well like its peace

9805
07:07:39,835 --> 07:07:42,900
and hdfs be may not consider
as an external Source.

9806
07:07:42,900 --> 07:07:45,300
It will be consider as
a file system of Hadoop.

9807
07:07:45,400 --> 07:07:47,300
So when Spock is working

9808
07:07:47,300 --> 07:07:49,743
with Hadoop mostly
the file system,

9809
07:07:49,743 --> 07:07:51,900
we will be using will be Hdfs,

9810
07:07:51,900 --> 07:07:53,782
if you can read
from it each piece

9811
07:07:53,782 --> 07:07:55,900
or even we can do
from other sources,

9812
07:07:55,900 --> 07:07:59,781
like Parkwood file or has
three different sources a roux.

9813
07:07:59,781 --> 07:08:02,000
You can read and create the RTD.

9814
07:08:02,200 --> 07:08:03,000
Next question is

9815
07:08:03,000 --> 07:08:05,800
what is executed memory
in spark application.

9816
07:08:05,800 --> 07:08:08,100
Every Spark application
will have fixed.

9817
07:08:08,100 --> 07:08:09,900
It keeps eyes and fixed number,

9818
07:08:09,900 --> 07:08:13,196
of course for the spark
executor executor is nothing

9819
07:08:13,196 --> 07:08:16,500
but the execution unit
available in every machine

9820
07:08:16,500 --> 07:08:19,600
and that's going to facilitate
to do the processing to do

9821
07:08:19,600 --> 07:08:21,654
the tasks in the Water machine,

9822
07:08:21,654 --> 07:08:25,300
so irrespective of whether you
use yarn resource manager

9823
07:08:25,300 --> 07:08:26,800
or any other measures

9824
07:08:26,800 --> 07:08:29,600
like resource manager
every worker Mission.

9825
07:08:29,600 --> 07:08:31,200
We will have an Executor

9826
07:08:31,200 --> 07:08:34,400
and within the executor
the task will be handled

9827
07:08:34,400 --> 07:08:38,700
and the memory to be allocated
for that particular executor is

9828
07:08:38,700 --> 07:08:41,893
what we Define as the hip size
and we can Define

9829
07:08:41,893 --> 07:08:42,775
how much amount

9830
07:08:42,775 --> 07:08:45,788
of memory should be used
for that particular executor

9831
07:08:45,788 --> 07:08:47,700
within the worker
machine as well.

9832
07:08:47,700 --> 07:08:50,900
As number of cores
can be used within the exit.

9833
07:08:51,000 --> 07:08:53,988
Our by the executor
with this path application

9834
07:08:53,988 --> 07:08:55,600
and that can be controlled

9835
07:08:55,600 --> 07:08:58,100
through the configuration
files of spark.

9836
07:08:58,100 --> 07:09:01,300
Next questions different
partitions in Apache spark.

9837
07:09:01,300 --> 07:09:03,100
So any data irrespective of

9838
07:09:03,100 --> 07:09:05,478
whether it is a small
data a large data,

9839
07:09:05,478 --> 07:09:07,213
we can divide those data sets

9840
07:09:07,213 --> 07:09:10,708
across multiple systems
the process of dividing the data

9841
07:09:10,708 --> 07:09:11,961
into multiple pieces

9842
07:09:11,961 --> 07:09:13,310
and making it to store

9843
07:09:13,310 --> 07:09:16,500
across multiple systems as
a different logical units.

9844
07:09:16,500 --> 07:09:17,549
It's called partitioning.

9845
07:09:17,549 --> 07:09:20,600
So in simple terms partitioning
is nothing but the process

9846
07:09:20,600 --> 07:09:21,700
of Dividing the data

9847
07:09:21,700 --> 07:09:24,800
and storing in multiple systems
is called partitions

9848
07:09:24,800 --> 07:09:26,600
and by default the conversion

9849
07:09:26,600 --> 07:09:29,700
of the data into R. TD
will happen in the system

9850
07:09:29,700 --> 07:09:31,400
where the partition is existing.

9851
07:09:31,400 --> 07:09:33,830
So the more the partition
the more parallelism

9852
07:09:33,830 --> 07:09:36,049
they are going to get
at the same time.

9853
07:09:36,049 --> 07:09:38,500
We have to be careful
not to trigger huge amount

9854
07:09:38,500 --> 07:09:40,100
of network data transfer as well

9855
07:09:40,300 --> 07:09:43,455
and every a DD can
be partitioned with inspark

9856
07:09:43,455 --> 07:09:45,700
and the panel
is the partitioning

9857
07:09:45,700 --> 07:09:49,559
going to help us to achieve
parallelism more the partition

9858
07:09:49,559 --> 07:09:50,685
that we have more.

9859
07:09:50,685 --> 07:09:52,000
Solutions can be done

9860
07:09:52,000 --> 07:09:54,300
and that the key thing
about the success

9861
07:09:54,300 --> 07:09:58,200
of the spark program is
minimizing the network traffic

9862
07:09:58,200 --> 07:10:00,900
while doing the parallel
processing and minimizing

9863
07:10:00,900 --> 07:10:04,247
the data transfer
within the systems of spark.

9864
07:10:04,247 --> 07:10:08,000
What operations does already
support so I can operate

9865
07:10:08,000 --> 07:10:10,228
multiple operations
against our GD.

9866
07:10:10,228 --> 07:10:13,900
So there are two type of things
we can do we can group it

9867
07:10:13,900 --> 07:10:16,000
into two one is transformations

9868
07:10:16,000 --> 07:10:18,800
in Transformations are did he
will get transformed

9869
07:10:18,800 --> 07:10:20,600
from one form to another form.

9870
07:10:20,600 --> 07:10:22,600
Select filtering grouping all

9871
07:10:22,600 --> 07:10:25,000
that like it's going
to get transformed

9872
07:10:25,000 --> 07:10:28,000
from one form to another form
one small example,

9873
07:10:28,000 --> 07:10:31,470
like reduced by key filter all
that will be Transformations.

9874
07:10:31,470 --> 07:10:33,700
The resultant of
the transformation will be

9875
07:10:33,700 --> 07:10:35,300
another rdd the same time.

9876
07:10:35,300 --> 07:10:37,700
We can take some actions
against the rdd

9877
07:10:37,700 --> 07:10:40,245
that's going to give
us the final result.

9878
07:10:40,245 --> 07:10:41,262
I can say count

9879
07:10:41,262 --> 07:10:43,500
how many records
or they are store

9880
07:10:43,500 --> 07:10:45,700
that result into the hdfs.

9881
07:10:46,100 --> 07:10:49,541
They all our actions so
multiple actions can be taken

9882
07:10:49,541 --> 07:10:50,600
against the RTD.

9883
07:10:50,600 --> 07:10:53,700
The existence of the data
will come into picture only

9884
07:10:53,700 --> 07:10:56,200
if I take some action
against not ready.

9885
07:10:56,200 --> 07:10:56,515
Okay.

9886
07:10:56,515 --> 07:10:57,400
Next question.

9887
07:10:57,400 --> 07:11:01,000
What do you understand
by transformations in spark?

9888
07:11:01,100 --> 07:11:03,679
So Transformations are
nothing but functions

9889
07:11:03,679 --> 07:11:06,800
mostly it will be higher
order functions within scale

9890
07:11:06,800 --> 07:11:09,400
and we have something
like a higher order functions

9891
07:11:09,400 --> 07:11:12,356
which will be applied
against the tardy.

9892
07:11:12,356 --> 07:11:14,100
Mostly against the list

9893
07:11:14,100 --> 07:11:16,407
of elements that we
have within the rdd

9894
07:11:16,407 --> 07:11:19,314
that function will get
applied by the existence

9895
07:11:19,314 --> 07:11:21,875
of the arditi will Come
into picture one lie

9896
07:11:21,875 --> 07:11:25,597
if we take some action against
it in this particular example,

9897
07:11:25,597 --> 07:11:26,900
I am reading the file

9898
07:11:26,900 --> 07:11:30,536
and having it within the rdd
Control Data then I am doing

9899
07:11:30,536 --> 07:11:32,500
some transformation using a map.

9900
07:11:32,500 --> 07:11:34,382
So it's going
to apply a function

9901
07:11:34,382 --> 07:11:35,623
so we can map I have

9902
07:11:35,623 --> 07:11:39,100
some function which will split
each record using the tab.

9903
07:11:39,100 --> 07:11:41,632
So the spit with the app
will be applied

9904
07:11:41,632 --> 07:11:44,300
against each record
within the raw data

9905
07:11:44,300 --> 07:11:48,200
and the resultant movies data
will again be another rdd,

9906
07:11:48,200 --> 07:11:50,644
but of course,
this will be a lazy operation.

9907
07:11:50,644 --> 07:11:53,700
The existence of movies data
will come into picture only

9908
07:11:53,700 --> 07:11:57,700
if I take some action
against it like count or print

9909
07:11:57,726 --> 07:12:01,573
or store only those actions
will generate the data.

9910
07:12:01,800 --> 07:12:04,600
So next question
Define functions of spark code.

9911
07:12:04,600 --> 07:12:07,100
So that's going to take care
of the memory management

9912
07:12:07,100 --> 07:12:09,400
and fault tolerance of rdds.

9913
07:12:09,400 --> 07:12:12,700
It's going to help us
to schedule distribute the task

9914
07:12:12,700 --> 07:12:15,400
and manage the jobs running
within the cluster

9915
07:12:15,400 --> 07:12:17,700
and so we're going to help
us to or store the rear

9916
07:12:17,700 --> 07:12:20,700
in the storage system as well
as reads data from the storage.

9917
07:12:20,700 --> 07:12:23,905
System that's to do the file
system level operations.

9918
07:12:23,905 --> 07:12:25,200
It's going to help us

9919
07:12:25,200 --> 07:12:27,500
and Spark core programming
can be done in any

9920
07:12:27,500 --> 07:12:30,347
of these languages
like Java scalar python

9921
07:12:30,347 --> 07:12:32,500
as well as using our so core is

9922
07:12:32,500 --> 07:12:35,600
that the horizontal level
on top of spark or we can have

9923
07:12:35,600 --> 07:12:37,500
a number of components

9924
07:12:37,600 --> 07:12:41,000
and there are different type
of rdds available one such

9925
07:12:41,000 --> 07:12:42,923
a special type is parody.

9926
07:12:42,923 --> 07:12:43,800
So next question.

9927
07:12:43,800 --> 07:12:46,100
What do you understand
by pay an rdd?

9928
07:12:46,100 --> 07:12:49,792
It's going to exist
in peace as a keys and values

9929
07:12:49,800 --> 07:12:51,906
so I can Some special functions

9930
07:12:51,906 --> 07:12:55,400
within the parodies
are special Transformations,

9931
07:12:55,400 --> 07:12:58,900
like connect all the values
corresponding to the same key

9932
07:12:58,900 --> 07:13:00,200
like solder Shuffle

9933
07:13:00,300 --> 07:13:02,800
what happens within
the shortened Shuffle of Hadoop

9934
07:13:02,900 --> 07:13:04,356
those type of operations

9935
07:13:04,356 --> 07:13:05,161
like you want

9936
07:13:05,161 --> 07:13:08,339
to consolidate our group
all the values corresponding

9937
07:13:08,339 --> 07:13:10,792
to the same key are
apply some functions

9938
07:13:10,792 --> 07:13:14,400
against all the values
corresponding to the same key.

9939
07:13:14,400 --> 07:13:16,200
Like I want to get the sum

9940
07:13:16,200 --> 07:13:20,400
of the value of all the keys
we can use the parody.

9941
07:13:20,400 --> 07:13:23,600
D and get that a cheat so
it's going to the data

9942
07:13:23,600 --> 07:13:29,300
within the re going to exist
in Pace keys and right.

9943
07:13:29,300 --> 07:13:31,376
Okay a question from Jason.

9944
07:13:31,376 --> 07:13:33,223
What are our Vector rdds

9945
07:13:33,300 --> 07:13:36,300
in machine learning you
will have huge amount

9946
07:13:36,300 --> 07:13:38,700
of processing handled by vectors

9947
07:13:38,700 --> 07:13:42,812
and matrices and we do lots
of operations Vector operations,

9948
07:13:42,812 --> 07:13:44,200
like effective actor

9949
07:13:44,200 --> 07:13:47,700
or transforming any data
into a vector form so vectors

9950
07:13:47,700 --> 07:13:50,755
like as the normal way
it will have a Direction.

9951
07:13:50,755 --> 07:13:51,624
And magnitude

9952
07:13:51,624 --> 07:13:54,900
so we can do some operations
like some two vectors

9953
07:13:54,900 --> 07:13:58,622
and what is the difference
between the vector A

9954
07:13:58,622 --> 07:14:00,500
and B as well as a and see

9955
07:14:00,500 --> 07:14:02,400
if the difference
between Vector A

9956
07:14:02,400 --> 07:14:04,200
and B is less compared to a

9957
07:14:04,200 --> 07:14:06,487
and C we can say the vector A

9958
07:14:06,487 --> 07:14:10,825
and B is somewhat similar
in terms of features.

9959
07:14:11,100 --> 07:14:13,815
So the vector R GD
will be used to represent

9960
07:14:13,815 --> 07:14:17,100
the vector directly and
that will be used extensively

9961
07:14:17,100 --> 07:14:19,500
while doing the
measuring and Jason.

9962
07:14:19,700 --> 07:14:20,500
Thank you other.

9963
07:14:20,500 --> 07:14:21,400
Is another question.

9964
07:14:21,400 --> 07:14:22,900
What is our GD lineage?

9965
07:14:22,900 --> 07:14:25,800
So here I any data
processing any Transformations

9966
07:14:25,800 --> 07:14:28,811
that we do it maintains
something called a lineage.

9967
07:14:28,811 --> 07:14:31,100
So what how data
is getting transformed

9968
07:14:31,100 --> 07:14:33,543
when the data is available
in the partition form

9969
07:14:33,543 --> 07:14:36,300
in multiple systems and
when we do the transformation,

9970
07:14:36,300 --> 07:14:39,800
it will undergo multiple steps
and in the distributed word.

9971
07:14:39,800 --> 07:14:42,700
It's very common to have
failures of machines

9972
07:14:42,700 --> 07:14:45,200
or machines going
out of the network

9973
07:14:45,200 --> 07:14:47,000
and the system our framework

9974
07:14:47,000 --> 07:14:47,800
as it should be

9975
07:14:47,800 --> 07:14:50,800
in a position to handle
small handles it through.

9976
07:14:50,858 --> 07:14:55,800
Did he leave eh it can restore
the last partition only assume

9977
07:14:55,800 --> 07:14:59,004
like out of ten machines
data is distributed

9978
07:14:59,004 --> 07:15:00,828
across five machines out of

9979
07:15:00,828 --> 07:15:03,800
that those five machines
One mission is lost.

9980
07:15:03,800 --> 07:15:06,500
So whatever the
latest transformation

9981
07:15:06,500 --> 07:15:07,807
that had the data

9982
07:15:08,000 --> 07:15:10,100
for that particular
partition the partition

9983
07:15:10,100 --> 07:15:13,924
in the last mission alone
can be regenerated and it knows

9984
07:15:13,924 --> 07:15:16,700
how to regenerate that data
on how to get that result

9985
07:15:16,700 --> 07:15:18,384
and data using the concept

9986
07:15:18,384 --> 07:15:21,153
of rdd lineage so
from which Each data source,

9987
07:15:21,153 --> 07:15:22,200
it got generated.

9988
07:15:22,200 --> 07:15:23,800
What was its previous step.

9989
07:15:23,800 --> 07:15:26,300
So the completely
is will be available

9990
07:15:26,300 --> 07:15:29,724
and it's maintained by
the spark framework internally.

9991
07:15:29,724 --> 07:15:31,700
We call that as Oddities in eh,

9992
07:15:31,700 --> 07:15:34,682
what is point driver to put
it simply for those

9993
07:15:34,682 --> 07:15:37,600
who are from her
do background yarn back room.

9994
07:15:37,600 --> 07:15:40,000
We can compare this
to at muster.

9995
07:15:40,100 --> 07:15:43,300
Every application will
have a spark driver

9996
07:15:43,300 --> 07:15:44,900
that will have a spot context

9997
07:15:44,900 --> 07:15:47,550
which is going to moderate
the complete execution

9998
07:15:47,550 --> 07:15:50,200
of the job that will connect
to the spark master.

9999
07:15:50,500 --> 07:15:52,300
Delivers the RTD graph

10000
07:15:52,300 --> 07:15:54,900
that is the lineage
for the master

10001
07:15:54,900 --> 07:15:56,810
and the coordinate the tasks.

10002
07:15:56,810 --> 07:15:57,817
What are the tasks

10003
07:15:57,817 --> 07:16:00,700
that gets executed
in the distributed environment?

10004
07:16:00,700 --> 07:16:01,500
It can do

10005
07:16:01,500 --> 07:16:04,400
the parallel processing
do the Transformations

10006
07:16:04,600 --> 07:16:06,900
and actions against the RTD.

10007
07:16:06,900 --> 07:16:08,551
So it's a single
point of contact

10008
07:16:08,551 --> 07:16:10,100
for that specific application.

10009
07:16:10,100 --> 07:16:12,500
So smart driver
is a short linked

10010
07:16:12,500 --> 07:16:15,300
and the spawn context
within this part driver

10011
07:16:15,300 --> 07:16:18,558
is going to be the coordinator
between the master and the tasks

10012
07:16:18,558 --> 07:16:20,694
that are running
and smart driver.

10013
07:16:20,694 --> 07:16:23,100
I can get started
in any of the executor

10014
07:16:23,100 --> 07:16:26,800
with inspark name types
of custom managers in spark.

10015
07:16:26,800 --> 07:16:28,800
So whenever you have
a group of machines,

10016
07:16:28,800 --> 07:16:30,247
you need a manager to manage

10017
07:16:30,247 --> 07:16:33,415
the resources the different type
of the store manager already.

10018
07:16:33,415 --> 07:16:35,700
We have seen the yarn
yet another assist ago.

10019
07:16:35,700 --> 07:16:39,400
She later which manages
the resources of Hadoop on top

10020
07:16:39,400 --> 07:16:43,000
of yarn we can make
Spock to book sometimes I

10021
07:16:43,000 --> 07:16:46,700
may want to have sparkle
own my organization

10022
07:16:46,700 --> 07:16:49,594
and not along with the Hadoop
or any other technology.

10023
07:16:49,594 --> 07:16:50,297
Then I can go

10024
07:16:50,297 --> 07:16:53,100
with the And alone spawn
has built-in cluster manager.

10025
07:16:53,100 --> 07:16:55,547
So only spawn can get
executed multiple systems.

10026
07:16:55,547 --> 07:16:57,423
But generally if we
have a cluster we

10027
07:16:57,423 --> 07:16:58,600
will try to leverage

10028
07:16:58,600 --> 07:17:01,600
various other Computing
platforms Computing Frameworks,

10029
07:17:01,600 --> 07:17:04,601
like graph processing
giraffe these on that.

10030
07:17:04,601 --> 07:17:07,000
We will try to
leverage that case.

10031
07:17:07,000 --> 07:17:08,321
We will go with yarn

10032
07:17:08,321 --> 07:17:10,700
or some generalized
resource manager,

10033
07:17:10,700 --> 07:17:12,000
like masseuse Ian.

10034
07:17:12,000 --> 07:17:14,400
It's very specific to Hadoop
and it comes along

10035
07:17:14,400 --> 07:17:18,500
with Hadoop measures is the
cluster level resource manager

10036
07:17:18,500 --> 07:17:20,600
and I have multiple clusters.

10037
07:17:20,600 --> 07:17:23,700
Within organization,
then you can use mrs.

10038
07:17:23,800 --> 07:17:25,883
Mrs. Is also a resource manager.

10039
07:17:25,883 --> 07:17:29,400
It's a separate table project
within Apache X question.

10040
07:17:29,400 --> 07:17:30,600
What do you understand

10041
07:17:30,600 --> 07:17:34,200
by worker node in a cluster
redistribute environment.

10042
07:17:34,200 --> 07:17:36,252
We will have n number
of workers we call

10043
07:17:36,252 --> 07:17:38,200
that is a worker node
or a slave node,

10044
07:17:38,200 --> 07:17:40,665
which does the actual
processing going to get

10045
07:17:40,665 --> 07:17:43,300
the data do the processing
and get us the result

10046
07:17:43,300 --> 07:17:45,100
and masternode going to assign

10047
07:17:45,100 --> 07:17:48,000
what has to be done by
one person own and it's going

10048
07:17:48,000 --> 07:17:50,551
to read the data available
in the specific work on.

10049
07:17:50,551 --> 07:17:53,196
Generally, the tasks assigned
to the worker node,

10050
07:17:53,196 --> 07:17:55,900
or the task will be assigned
to the output node data

10051
07:17:55,900 --> 07:17:57,500
is located in vigorous Pace.

10052
07:17:57,500 --> 07:18:00,100
Especially Hadoop always
it will try to achieve

10053
07:18:00,100 --> 07:18:01,183
the data locality.

10054
07:18:01,183 --> 07:18:04,391
That's what we can't is
the resource availability as

10055
07:18:04,391 --> 07:18:05,900
well as the availability

10056
07:18:05,900 --> 07:18:08,900
of the resource in terms
of CPU memory as well

10057
07:18:08,900 --> 07:18:10,000
will be considered

10058
07:18:10,000 --> 07:18:13,601
as you might have some data
in replicated in three missions.

10059
07:18:13,601 --> 07:18:16,884
All three machines are busy
doing the work and no CPU

10060
07:18:16,884 --> 07:18:19,414
or memory available
to start the other task.

10061
07:18:19,414 --> 07:18:20,400
It will not wait.

10062
07:18:20,400 --> 07:18:23,300
For those missions to complete
the job and get the resource

10063
07:18:23,300 --> 07:18:25,900
and do the processing it
will start the processing

10064
07:18:25,900 --> 07:18:27,000
and some other machine

10065
07:18:27,000 --> 07:18:28,200
which is going to be near

10066
07:18:28,200 --> 07:18:31,300
to that the missions having
the data and read the data

10067
07:18:31,300 --> 07:18:32,400
over the network.

10068
07:18:32,600 --> 07:18:35,100
So to answer straight
or commissions are nothing but

10069
07:18:35,100 --> 07:18:36,600
which does the actual work

10070
07:18:36,600 --> 07:18:37,755
and going to report

10071
07:18:37,755 --> 07:18:41,315
to the master in terms of what
is the resource utilization

10072
07:18:41,315 --> 07:18:42,627
and the tasks running

10073
07:18:42,627 --> 07:18:46,000
within the work emissions
will be doing the actual work

10074
07:18:46,000 --> 07:18:49,049
and what ways as past Vector
just few minutes back.

10075
07:18:49,049 --> 07:18:50,656
I was answering a question.

10076
07:18:50,656 --> 07:18:52,697
What is a vector
vector is nothing

10077
07:18:52,697 --> 07:18:55,500
but representing the data
in multi dimensional form?

10078
07:18:55,500 --> 07:18:57,500
The vector can
be multi-dimensional

10079
07:18:57,500 --> 07:18:58,500
Vector as well.

10080
07:18:58,500 --> 07:19:02,400
As you know, I am going
to represent a point in space.

10081
07:19:02,400 --> 07:19:04,938
I need three dimensions
the X yandamp;z.

10082
07:19:05,000 --> 07:19:08,076
So the vector will
have three dimensions.

10083
07:19:08,300 --> 07:19:10,934
If I need to represent
a line in the species.

10084
07:19:10,934 --> 07:19:14,107
Then I need two points
to represent the starting point

10085
07:19:14,107 --> 07:19:17,700
of the line and the endpoint
of the line then I need a vector

10086
07:19:17,700 --> 07:19:18,800
which can hold

10087
07:19:18,800 --> 07:19:21,049
so it will have two Dimensions
the first First Dimension

10088
07:19:21,049 --> 07:19:23,121
will have one point
the second dimension

10089
07:19:23,121 --> 07:19:24,400
will have another Point

10090
07:19:24,400 --> 07:19:25,429
let us say point B

10091
07:19:25,429 --> 07:19:29,200
if I have to represent a plane
then I need another dimension

10092
07:19:29,200 --> 07:19:30,702
to represent two lines.

10093
07:19:30,702 --> 07:19:31,510
So each line

10094
07:19:31,510 --> 07:19:34,203
will be representing
two points same way.

10095
07:19:34,203 --> 07:19:37,200
I can represent any data
using a vector form

10096
07:19:37,200 --> 07:19:40,217
as you might have
huge number of feedback

10097
07:19:40,217 --> 07:19:43,500
or ratings of products
across an organization.

10098
07:19:43,500 --> 07:19:46,327
Let's take a simple example
Amazon Amazon have

10099
07:19:46,327 --> 07:19:47,632
millions of products.

10100
07:19:47,632 --> 07:19:50,498
Not every user not even
a single user would have

10101
07:19:50,498 --> 07:19:53,461
It was millions of all
the products within Amazon.

10102
07:19:53,461 --> 07:19:55,341
The only hardly
we would have used

10103
07:19:55,341 --> 07:19:58,400
like a point one percent
or like even less than that,

10104
07:19:58,400 --> 07:20:00,200
maybe like few hundred products.

10105
07:20:00,200 --> 07:20:02,600
We would have used
and rated the products

10106
07:20:02,600 --> 07:20:04,600
within amazing for
the complete lifetime.

10107
07:20:04,600 --> 07:20:07,700
If I have to represent
all ratings of the products

10108
07:20:07,700 --> 07:20:10,194
with director and see
the first position

10109
07:20:10,194 --> 07:20:13,400
of the rating it's going
to refer to the product

10110
07:20:13,400 --> 07:20:15,200
with ID 1 second position.

10111
07:20:15,200 --> 07:20:17,600
It's going to refer
to the product with ID 2.

10112
07:20:17,600 --> 07:20:20,700
So I will have million values
within that particular vector.

10113
07:20:20,700 --> 07:20:22,645
After out of million values,

10114
07:20:22,645 --> 07:20:25,493
I'll have only values
400 products where I

10115
07:20:25,493 --> 07:20:27,300
have provided the ratings.

10116
07:20:27,400 --> 07:20:30,947
So it may vary from number
1 to 5 for all others.

10117
07:20:30,947 --> 07:20:34,200
It will say 0 sparse
pins thinly distributed.

10118
07:20:34,800 --> 07:20:38,774
So to represent the huge amount
of data with the position

10119
07:20:38,774 --> 07:20:41,900
and saying this particular
position is having

10120
07:20:41,900 --> 07:20:43,800
a 0 value we can mention

10121
07:20:43,800 --> 07:20:45,900
that with a key and value.

10122
07:20:45,900 --> 07:20:47,415
So what position having

10123
07:20:47,415 --> 07:20:51,500
what value rather than storing
all Zero seconds told one lie

10124
07:20:51,500 --> 07:20:55,471
non-zeros the position of it and
that the corresponding value.

10125
07:20:55,471 --> 07:20:58,400
That means all others going
to be a zero value

10126
07:20:58,400 --> 07:21:01,400
so we can mention
this particular space

10127
07:21:01,400 --> 07:21:05,400
Vector mentioning it
to representa nonzero entities.

10128
07:21:05,400 --> 07:21:08,300
So to store only
the nonzero entities

10129
07:21:08,300 --> 07:21:10,364
this Mass Factor will be used

10130
07:21:10,364 --> 07:21:12,500
so that we don't need to based

10131
07:21:12,500 --> 07:21:15,550
on additional space was
during this past Vector.

10132
07:21:15,550 --> 07:21:18,600
Let's discuss some questions
on spark streaming.

10133
07:21:18,600 --> 07:21:21,422
How is streaming Dad
in sparking explained

10134
07:21:21,422 --> 07:21:23,900
with examples smart
streaming is used

10135
07:21:23,900 --> 07:21:25,452
for processing real-time

10136
07:21:25,452 --> 07:21:29,500
streaming data to precisely say
it's a micro batch processing.

10137
07:21:29,500 --> 07:21:32,852
So data will be collected
between every small interval say

10138
07:21:32,852 --> 07:21:35,128
maybe like .5 seconds
or every seconds

10139
07:21:35,128 --> 07:21:36,200
until you get processed.

10140
07:21:36,200 --> 07:21:36,900
So internally,

10141
07:21:36,900 --> 07:21:40,100
it's going to create
micro patches the data created

10142
07:21:40,100 --> 07:21:43,800
out of that micro batch we call
there is a d stream the stream

10143
07:21:43,800 --> 07:21:45,500
is like a and ready

10144
07:21:45,500 --> 07:21:48,200
so I can do
Transformations and actions.

10145
07:21:48,200 --> 07:21:50,691
Whatever that I do
with our DD I can do

10146
07:21:50,691 --> 07:21:52,200
With the stream as well

10147
07:21:52,500 --> 07:21:57,100
and Spark streaming can read
data from Flume hdfs are

10148
07:21:57,100 --> 07:21:59,500
other streaming services Aspen

10149
07:21:59,800 --> 07:22:02,565
and store the data
in the dashboard or in

10150
07:22:02,565 --> 07:22:06,300
any other database and it
provides very high throughput

10151
07:22:06,400 --> 07:22:09,200
as it can be processed with
a number of different systems

10152
07:22:09,200 --> 07:22:11,800
in a distributed
fashion again streaming.

10153
07:22:11,800 --> 07:22:14,858
This stream will be partitioned
internally and it has

10154
07:22:14,858 --> 07:22:17,100
the built-in feature
of fault tolerance,

10155
07:22:17,100 --> 07:22:18,700
even if any data is lost

10156
07:22:18,700 --> 07:22:22,100
and it's transformed already
is Lost it can regenerate

10157
07:22:22,100 --> 07:22:23,930
those rdds from the existing

10158
07:22:23,930 --> 07:22:25,500
or from the source data.

10159
07:22:25,500 --> 07:22:28,100
So these three is going
to be the building block

10160
07:22:28,100 --> 07:22:32,748
of streaming and it has
the fault tolerance mechanism

10161
07:22:32,748 --> 07:22:34,902
what we have within the RTD.

10162
07:22:35,000 --> 07:22:38,600
So this stream are specialized
on Didi specialized form

10163
07:22:38,600 --> 07:22:42,000
of our GD specifically to use it
within this box dreaming.

10164
07:22:42,000 --> 07:22:42,253
Okay.

10165
07:22:42,253 --> 07:22:42,963
Next question.

10166
07:22:42,963 --> 07:22:45,600
What is the significance
of sliding window operation?

10167
07:22:45,600 --> 07:22:48,700
That's a very interesting one
in the streaming data whenever

10168
07:22:48,700 --> 07:22:50,600
we do the Computing the data.

10169
07:22:50,600 --> 07:22:53,218
Density are the
business implications

10170
07:22:53,218 --> 07:22:56,500
of that specific data
May oscillate a lot.

10171
07:22:56,500 --> 07:22:58,400
For example within Twitter.

10172
07:22:58,400 --> 07:23:01,455
We used to say the trending
tweet hashtag just

10173
07:23:01,455 --> 07:23:03,900
because that hashtag
is very popular.

10174
07:23:03,900 --> 07:23:06,200
Maybe someone might have hacked
into the system

10175
07:23:06,200 --> 07:23:09,500
and used a number of tweets
maybe for that particular

10176
07:23:09,500 --> 07:23:12,202
our it might have appeared
millions of times just

10177
07:23:12,202 --> 07:23:15,123
because it appear billions
of times for that specific

10178
07:23:15,123 --> 07:23:16,107
and minute duration

10179
07:23:16,107 --> 07:23:18,800
or like say to three minute
duration each not getting

10180
07:23:18,800 --> 07:23:20,200
to the trending tank.

10181
07:23:20,200 --> 07:23:22,286
Trending hashtag for
that particular day

10182
07:23:22,286 --> 07:23:23,992
or for that particular month.

10183
07:23:23,992 --> 07:23:26,700
So what we will do we
will try to do an average.

10184
07:23:26,700 --> 07:23:29,357
So like a window
this current time frame

10185
07:23:29,357 --> 07:23:32,500
and T minus 1 T minus 2 all
the data we will consider

10186
07:23:32,500 --> 07:23:34,807
and we will try to find
the average or some

10187
07:23:34,807 --> 07:23:37,276
so the complete business logic
will be applied

10188
07:23:37,276 --> 07:23:39,100
against that particular window.

10189
07:23:39,200 --> 07:23:43,400
So any drastic changes
on to precisely say the spike

10190
07:23:43,500 --> 07:23:46,200
or deep very
drastic spinal cords

10191
07:23:46,200 --> 07:23:50,300
drastic deep in the pattern
of the data will be normalized.

10192
07:23:50,300 --> 07:23:51,100
So that's the

10193
07:23:51,100 --> 07:23:54,452
because significance of using
the sliding window operation

10194
07:23:54,452 --> 07:23:55,800
with inspark streaming

10195
07:23:55,800 --> 07:23:59,600
and smart can handle this
sliding window automatically.

10196
07:23:59,600 --> 07:24:04,000
It can store the prior data
the T minus 1 T minus 2 and

10197
07:24:04,000 --> 07:24:06,300
how big the window
needs to be maintained

10198
07:24:06,300 --> 07:24:09,192
or that can be handled easily
within the program

10199
07:24:09,192 --> 07:24:11,100
and it's at the abstract level.

10200
07:24:11,300 --> 07:24:12,100
Next question is

10201
07:24:12,100 --> 07:24:15,600
what is destroying the expansion
is discretized stream.

10202
07:24:15,600 --> 07:24:17,600
So that's the abstract form

10203
07:24:17,600 --> 07:24:20,500
or the which will form
of representation of the data.

10204
07:24:20,500 --> 07:24:22,494
For the spark
streaming the same way,

10205
07:24:22,494 --> 07:24:25,200
how are ready getting
transformed from one form

10206
07:24:25,200 --> 07:24:26,200
to another form?

10207
07:24:26,200 --> 07:24:27,504
We will have series

10208
07:24:27,504 --> 07:24:30,800
of oddities all put together
called as a d string

10209
07:24:30,800 --> 07:24:32,100
so this term is nothing

10210
07:24:32,100 --> 07:24:34,000
but it's another representation

10211
07:24:34,000 --> 07:24:36,593
of our GD are like
to group of oddities

10212
07:24:36,593 --> 07:24:38,223
because there is a stream

10213
07:24:38,223 --> 07:24:41,100
and I can apply
the streaming functions

10214
07:24:41,100 --> 07:24:43,921
or any of the functions
Transformations are actions

10215
07:24:43,921 --> 07:24:47,200
available within the streaming
against this D string

10216
07:24:47,300 --> 07:24:49,674
So within that
particular micro badge,

10217
07:24:49,674 --> 07:24:51,600
so I will Define What interval

10218
07:24:51,600 --> 07:24:54,377
the data should be collected
on should be processed

10219
07:24:54,377 --> 07:24:56,100
because there is a micro batch.

10220
07:24:56,100 --> 07:24:59,900
It could be every 1 second
or every hundred milliseconds

10221
07:24:59,900 --> 07:25:01,000
or every five seconds.

10222
07:25:01,300 --> 07:25:02,300
I can Define that page

10223
07:25:02,300 --> 07:25:04,300
particular period so
all the data is used

10224
07:25:04,300 --> 07:25:07,300
in that particular duration
will be considered

10225
07:25:07,300 --> 07:25:08,400
as a piece of data

10226
07:25:08,400 --> 07:25:09,600
and that will be called

10227
07:25:09,600 --> 07:25:13,400
as ADI string s question explain
casing in spark streaming.

10228
07:25:13,400 --> 07:25:14,000
Of course.

10229
07:25:14,000 --> 07:25:15,000
Yes Mark internally.

10230
07:25:15,000 --> 07:25:16,300
It uses in memory Computing.

10231
07:25:16,600 --> 07:25:18,700
So any data when it
is doing the Computing

10232
07:25:18,900 --> 07:25:21,600
that's killing generated
will be there in Mary but find

10233
07:25:21,600 --> 07:25:25,100
that if you do more and more
processing with other jobs

10234
07:25:25,100 --> 07:25:27,190
when there is a need
for more memory,

10235
07:25:27,190 --> 07:25:30,500
the least used on DDS will be
clear enough from the memory

10236
07:25:30,500 --> 07:25:34,100
or the least used data
available out of actions

10237
07:25:34,100 --> 07:25:36,700
from the arditi will be cleared
of from the memory.

10238
07:25:36,700 --> 07:25:40,000
Sometimes I may need
that data forever in memory,

10239
07:25:40,000 --> 07:25:41,800
very simple example,
like dictionary.

10240
07:25:42,100 --> 07:25:43,600
I want the dictionary words

10241
07:25:43,600 --> 07:25:45,658
should be always
available in memory

10242
07:25:45,658 --> 07:25:48,900
because I may do a spell check
against the Tweet comments

10243
07:25:48,900 --> 07:25:51,500
or feedback comments
and our of nines.

10244
07:25:51,500 --> 07:25:54,900
So what I can do I
can say KH those any data

10245
07:25:54,900 --> 07:25:57,036
that comes in we can cash it.

10246
07:25:57,036 --> 07:25:59,100
What possessed it in memory.

10247
07:25:59,100 --> 07:26:02,100
So even when there is a need
for memory by other applications

10248
07:26:02,100 --> 07:26:05,800
this specific data will
not be remote and especially

10249
07:26:05,800 --> 07:26:08,800
that will be used to do
the further processing

10250
07:26:08,800 --> 07:26:11,500
and the casing
also can be defined

10251
07:26:11,500 --> 07:26:15,200
whether it should be in memory
only I in memory and hard disk

10252
07:26:15,200 --> 07:26:17,000
that also we can Define it.

10253
07:26:17,000 --> 07:26:20,100
Let's discuss some questions
on spark graphics.

10254
07:26:20,300 --> 07:26:24,000
The next question is is there
an APA for implementing collapse

10255
07:26:24,000 --> 07:26:26,200
and Spark in graph Theory?

10256
07:26:26,600 --> 07:26:28,100
Everything will be represented

10257
07:26:28,100 --> 07:26:33,200
as a graph is a graph it
will have nodes and edges.

10258
07:26:33,419 --> 07:26:36,880
So all will be represented
using the arteries.

10259
07:26:37,000 --> 07:26:40,300
So it's going to extend
the RTD and there is

10260
07:26:40,300 --> 07:26:42,482
a component called graphics

10261
07:26:42,500 --> 07:26:44,983
and it exposes
the functionalities

10262
07:26:44,983 --> 07:26:49,800
to represent a graph we can have
H RG D buttocks rdd by creating.

10263
07:26:49,800 --> 07:26:51,700
During the edges and vertex.

10264
07:26:51,700 --> 07:26:53,239
I can create a graph

10265
07:26:53,500 --> 07:26:57,400
and this graph can exist
in a distributed environment.

10266
07:26:57,400 --> 07:27:00,208
So same way we will be
in a position to do

10267
07:27:00,208 --> 07:27:02,400
the parallel processing as well.

10268
07:27:02,700 --> 07:27:06,300
So Graphics, it's just
a form of representing

10269
07:27:06,400 --> 07:27:11,200
the data paragraphs with edges
and the traces and of course,

10270
07:27:11,200 --> 07:27:14,299
yes, it provides the APA
to implement out create

10271
07:27:14,299 --> 07:27:17,400
the graph do the processing
on the graph the APA

10272
07:27:17,400 --> 07:27:19,900
so divided what is Page rank?

10273
07:27:20,100 --> 07:27:24,600
Graphics we didn't have sex
once the graph is created.

10274
07:27:24,600 --> 07:27:28,900
We can calculate the page rank
for a particular note.

10275
07:27:29,100 --> 07:27:32,000
So that's very similar to
how we have the page rank

10276
07:27:32,100 --> 07:27:35,635
for the websites within Google
the higher the page rank.

10277
07:27:35,635 --> 07:27:38,774
That means it's more important
within that particular graph.

10278
07:27:38,774 --> 07:27:40,547
It's going to
show the importance

10279
07:27:40,547 --> 07:27:41,900
of that particular node

10280
07:27:41,900 --> 07:27:45,154
or Edge within that particular
graph is a graph is

10281
07:27:45,154 --> 07:27:46,700
a connected set of data.

10282
07:27:46,800 --> 07:27:49,600
All right, I will be connected
using the property

10283
07:27:49,600 --> 07:27:51,100
and How much important

10284
07:27:51,100 --> 07:27:55,300
that property makes we will have
a value Associated to it.

10285
07:27:55,500 --> 07:27:57,900
So within pagerank
we can calculate

10286
07:27:57,900 --> 07:27:59,100
like a static page rank.

10287
07:27:59,300 --> 07:28:00,703
It will run a number

10288
07:28:00,703 --> 07:28:03,300
of iterations or there
is another page

10289
07:28:03,300 --> 07:28:06,600
and code anomic page rank
that will get executed

10290
07:28:06,600 --> 07:28:09,200
till we reach
a particular saturation level

10291
07:28:09,300 --> 07:28:13,600
and the saturation level can be
defined with multiple criterias

10292
07:28:14,100 --> 07:28:15,200
and the APA is

10293
07:28:15,200 --> 07:28:17,500
because there is
a graph operations.

10294
07:28:17,700 --> 07:28:20,289
And be direct executed
against those graph

10295
07:28:20,289 --> 07:28:23,700
and they all are available
as a PA within the graphics.

10296
07:28:24,103 --> 07:28:25,796
What is lineage graph?

10297
07:28:26,000 --> 07:28:28,400
So the audit is very similar

10298
07:28:28,500 --> 07:28:32,800
to the graphics how the
graph representation every rtt.

10299
07:28:32,800 --> 07:28:33,800
Internally.

10300
07:28:33,800 --> 07:28:36,400
It will have the relation saying

10301
07:28:36,500 --> 07:28:39,157
how that particular
rdd got created.

10302
07:28:39,157 --> 07:28:42,725
And from where how
that got transformed argit is

10303
07:28:42,725 --> 07:28:44,700
how their got transformed.

10304
07:28:44,700 --> 07:28:47,600
So the complete lineage
or the complete history

10305
07:28:47,600 --> 07:28:50,587
or the complete path
will be recorded

10306
07:28:50,587 --> 07:28:51,900
within the lineage.

10307
07:28:52,100 --> 07:28:53,517
That will be used in case

10308
07:28:53,517 --> 07:28:56,400
if any particular partition
of the target is lost.

10309
07:28:56,400 --> 07:28:57,900
It can be regenerated.

10310
07:28:58,000 --> 07:28:59,899
Even if the complete
artery is lost.

10311
07:28:59,899 --> 07:29:00,900
We can regenerate

10312
07:29:00,900 --> 07:29:03,149
so it will have the complete
information on what are

10313
07:29:03,149 --> 07:29:06,193
the partitions where it is
existing water Transformations.

10314
07:29:06,193 --> 07:29:07,119
It had undergone.

10315
07:29:07,119 --> 07:29:08,747
What is the resultant and you

10316
07:29:08,747 --> 07:29:10,600
if anything is lost
in the middle,

10317
07:29:10,600 --> 07:29:12,511
it knows where to recalculate

10318
07:29:12,511 --> 07:29:16,400
from and what are essential
things needs to be recalculated.

10319
07:29:16,400 --> 07:29:19,817
It's going to save us a lot
of time and if that Audrey

10320
07:29:19,817 --> 07:29:21,762
is never being used it will now.

10321
07:29:21,762 --> 07:29:23,100
Ever get recalculated.

10322
07:29:23,100 --> 07:29:26,500
So they recalculation also
triggers based on the action

10323
07:29:26,500 --> 07:29:27,799
only on need basis.

10324
07:29:27,799 --> 07:29:29,100
It will recalculate

10325
07:29:29,200 --> 07:29:32,500
that's why it's going
to use the memory optimally

10326
07:29:32,700 --> 07:29:36,087
does Apache spark provide
checkpointing officially

10327
07:29:36,087 --> 07:29:38,300
like the example
like a streaming

10328
07:29:38,600 --> 07:29:43,600
and if any data is lost within
that particular sliding window,

10329
07:29:43,600 --> 07:29:47,492
we cannot get back the data are
like the data will be lost

10330
07:29:47,492 --> 07:29:50,103
because Jim I'm making
a window of say 24

10331
07:29:50,103 --> 07:29:51,800
asks to do some averaging.

10332
07:29:51,800 --> 07:29:55,270
Each I'm making a sliding window
of 24 hours every 24 hours.

10333
07:29:55,270 --> 07:29:59,100
It will keep on getting slider
and if you lose any system

10334
07:29:59,100 --> 07:30:01,500
as in there is a complete
failure of the cluster.

10335
07:30:01,500 --> 07:30:02,562
I may lose the data

10336
07:30:02,562 --> 07:30:04,800
because it's all available
in the memory.

10337
07:30:04,900 --> 07:30:06,400
So how to recalculate

10338
07:30:06,400 --> 07:30:08,902
if the data system is lost
it follows something

10339
07:30:08,902 --> 07:30:10,100
called a checkpointing

10340
07:30:10,100 --> 07:30:12,831
so we can check point
the data and directly.

10341
07:30:12,831 --> 07:30:14,800
It's provided by the spark APA.

10342
07:30:14,800 --> 07:30:16,600
We have to just
provide the location

10343
07:30:16,600 --> 07:30:19,700
where it should get checked
pointed and you can read

10344
07:30:19,700 --> 07:30:23,200
that particular data back
when you Not the system again,

10345
07:30:23,200 --> 07:30:24,866
whatever the state it was

10346
07:30:24,866 --> 07:30:27,600
in be can regenerate
that particular data.

10347
07:30:27,700 --> 07:30:29,454
So yes to answer the question

10348
07:30:29,454 --> 07:30:32,300
straight about this path
points check monitoring

10349
07:30:32,300 --> 07:30:35,300
and it will help us
to regenerate the state

10350
07:30:35,300 --> 07:30:37,010
what it was earlier.

10351
07:30:37,200 --> 07:30:40,000
Let's move on to the next
component spark ml it.

10352
07:30:40,300 --> 07:30:41,515
How is machine learning

10353
07:30:41,515 --> 07:30:44,600
implemented in spark
the machine learning again?

10354
07:30:44,600 --> 07:30:46,800
It's a very huge ocean by itself

10355
07:30:46,900 --> 07:30:49,800
and it's not a technology
specific to spark

10356
07:30:49,800 --> 07:30:51,800
which learning is
a common data science.

10357
07:30:51,800 --> 07:30:55,235
It's a Set of data science work
where we have different type

10358
07:30:55,235 --> 07:30:57,983
of algorithms different
categories of algorithm,

10359
07:30:57,983 --> 07:31:01,100
like clustering regression
dimensionality reduction

10360
07:31:01,100 --> 07:31:02,100
or that we have

10361
07:31:02,300 --> 07:31:05,600
and all these algorithms
are most of the algorithms

10362
07:31:05,600 --> 07:31:08,070
have been implemented
in spark and smart is

10363
07:31:08,070 --> 07:31:09,481
the preferred framework

10364
07:31:09,481 --> 07:31:12,910
or before preferred application
component to do the machine

10365
07:31:12,910 --> 07:31:14,500
learning algorithm nowadays

10366
07:31:14,500 --> 07:31:16,500
or machine learning
processing the reason

10367
07:31:16,500 --> 07:31:19,700
because most of the machine
learning algorithms needs

10368
07:31:19,700 --> 07:31:21,890
to be executed i3t real number.

10369
07:31:21,890 --> 07:31:25,000
Of times till we get
the optimal result maybe

10370
07:31:25,000 --> 07:31:27,700
like say twenty five
iterations are 58 iterations

10371
07:31:27,700 --> 07:31:29,900
or till we get
that specific accuracy.

10372
07:31:29,900 --> 07:31:33,100
You will keep on running
the processing again and again

10373
07:31:33,100 --> 07:31:36,092
and smog is very good fit
whenever you want to do

10374
07:31:36,092 --> 07:31:37,900
the processing again and again

10375
07:31:37,900 --> 07:31:40,400
because the data
will be available in memory.

10376
07:31:40,400 --> 07:31:43,600
I can read it faster store
the data back into the memory

10377
07:31:43,600 --> 07:31:44,700
again reach faster

10378
07:31:44,700 --> 07:31:47,500
and all this machine learning
algorithms have been provided

10379
07:31:47,500 --> 07:31:50,800
within the spark a separate
component called ml lip

10380
07:31:50,900 --> 07:31:53,096
and within mlsp We
have other components

10381
07:31:53,096 --> 07:31:55,800
like feature Association
to extract the features.

10382
07:31:55,800 --> 07:31:58,575
You may be wondering
how they can process

10383
07:31:58,575 --> 07:32:02,600
the images the core thing about
processing a image or audio

10384
07:32:02,600 --> 07:32:04,922
or video is about
extracting the feature

10385
07:32:04,922 --> 07:32:08,363
and comparing the future
how much they are related.

10386
07:32:08,363 --> 07:32:10,300
So that's where
vectors matrices all

10387
07:32:10,300 --> 07:32:13,500
that will come into picture
and we can have pipeline

10388
07:32:13,500 --> 07:32:16,144
of processing as well
to the processing

10389
07:32:16,144 --> 07:32:18,800
one then take the result
and do the processing

10390
07:32:18,800 --> 07:32:21,700
to and it has persistence
algorithm as well.

10391
07:32:21,700 --> 07:32:24,234
The result of it
the generator process

10392
07:32:24,234 --> 07:32:25,999
the result it can be persisted

10393
07:32:25,999 --> 07:32:27,010
and reloaded back

10394
07:32:27,010 --> 07:32:29,421
into the system to
continue the processing

10395
07:32:29,421 --> 07:32:32,245
from that particular Point
onwards next question.

10396
07:32:32,245 --> 07:32:34,605
What are categories
of machine learning machine

10397
07:32:34,605 --> 07:32:38,000
learning assets different
categories available supervised

10398
07:32:38,000 --> 07:32:41,001
or unsupervised and
reinforced learning supervised

10399
07:32:41,001 --> 07:32:42,900
and surprised it's very popular

10400
07:32:43,200 --> 07:32:46,700
where we will know some
I'll give an example.

10401
07:32:47,200 --> 07:32:50,123
I'll know well
in advance what category

10402
07:32:50,123 --> 07:32:54,800
that belongs to Z. Want
to do a character recognition

10403
07:32:55,400 --> 07:32:57,185
while training the data,

10404
07:32:57,185 --> 07:33:01,800
I can give information saying
this particular image belongs

10405
07:33:01,800 --> 07:33:04,160
to this particular
category character

10406
07:33:04,160 --> 07:33:05,800
or this particular number

10407
07:33:05,800 --> 07:33:10,100
and I can train sometimes I
will not know well in advance

10408
07:33:10,100 --> 07:33:14,478
assume like I may have
different type of images

10409
07:33:14,700 --> 07:33:19,200
like it may have
cars bikes cat dog all that.

10410
07:33:19,400 --> 07:33:21,920
I want to know
how many category available.

10411
07:33:21,920 --> 07:33:25,279
No, I will not know well
in advance so I want to group it

10412
07:33:25,279 --> 07:33:26,900
how many category available

10413
07:33:26,900 --> 07:33:29,100
and then I'll
realize saying okay,

10414
07:33:29,100 --> 07:33:31,600
they're all this belongs
to a particular category.

10415
07:33:31,600 --> 07:33:33,800
I'll identify the pattern
within the category

10416
07:33:33,800 --> 07:33:36,333
and I'll give
a category named say

10417
07:33:36,333 --> 07:33:39,751
like all these images
belongs to boot category

10418
07:33:39,751 --> 07:33:41,300
on looks like a boat.

10419
07:33:41,500 --> 07:33:45,400
So leaving it to the system
by providing this value or not.

10420
07:33:45,400 --> 07:33:48,400
Let's say the cat is different
type of machine learning comes

10421
07:33:48,400 --> 07:33:49,503
into picture and

10422
07:33:49,503 --> 07:33:53,160
as such machine learning is
not specific to It's going

10423
07:33:53,160 --> 07:33:57,300
to help us to achieve to run
this machine learning algorithms

10424
07:33:57,400 --> 07:34:00,700
what our spark ml lead
tools MLA business thing

10425
07:34:00,700 --> 07:34:02,300
but machine learning library

10426
07:34:02,300 --> 07:34:03,700
or machine learning offering

10427
07:34:03,700 --> 07:34:07,200
within this Mark and has a
number of algorithms implemented

10428
07:34:07,200 --> 07:34:09,800
and it provides very
good feature to persist

10429
07:34:09,800 --> 07:34:12,306
the result generally
in machine learning.

10430
07:34:12,306 --> 07:34:14,509
We will generate
a model the pattern

10431
07:34:14,509 --> 07:34:17,089
of the data recorder
is a model the model

10432
07:34:17,089 --> 07:34:20,688
will be persisted either in
different forms Like Pat.

10433
07:34:20,688 --> 07:34:23,087
Quit I have
Through different forms,

10434
07:34:23,087 --> 07:34:26,700
it can be stored opposite
district and has methodologies

10435
07:34:26,700 --> 07:34:29,600
to extract the features
from a set of data.

10436
07:34:29,600 --> 07:34:31,353
I may have million images.

10437
07:34:31,353 --> 07:34:32,500
I want to extract

10438
07:34:32,500 --> 07:34:36,300
the common features available
within those millions of images

10439
07:34:36,300 --> 07:34:40,170
and other utilities
available to process to define

10440
07:34:40,170 --> 07:34:43,607
or like to define the seed
the randomizing it so

10441
07:34:43,607 --> 07:34:47,441
different utilities are
available as well as pipelines.

10442
07:34:47,441 --> 07:34:49,500
That's very specific to spark

10443
07:34:49,800 --> 07:34:53,300
where I can Channel
Arrange the sequence

10444
07:34:53,300 --> 07:34:56,700
of steps to be undergone by
the machine learning submission

10445
07:34:56,700 --> 07:34:58,100
learning one algorithm first

10446
07:34:58,100 --> 07:34:59,863
and then the result
of it will be fed

10447
07:34:59,863 --> 07:35:02,163
into a machine learning
algorithm to like that.

10448
07:35:02,163 --> 07:35:03,400
We can have a sequence

10449
07:35:03,400 --> 07:35:06,500
of execution and
that will be defined using

10450
07:35:06,500 --> 07:35:10,562
the pipeline's is Honorable
features of spark Emily.

10451
07:35:11,000 --> 07:35:15,100
What are some popular algorithms
and Utilities in spark Emily.

10452
07:35:15,500 --> 07:35:18,382
So these are some popular
algorithms like regression

10453
07:35:18,382 --> 07:35:22,000
classification basic statistics
recommendation system.

10454
07:35:22,000 --> 07:35:24,678
It's a comedy system is
like well implemented.

10455
07:35:24,678 --> 07:35:27,000
All we have to provide
is give the data.

10456
07:35:27,000 --> 07:35:30,579
If you give the ratings and
products within an organization,

10457
07:35:30,579 --> 07:35:32,400
if you have the complete damp,

10458
07:35:32,400 --> 07:35:35,800
we can build the recommendation
system in no time.

10459
07:35:35,800 --> 07:35:39,283
And if you give any user you
can give a recommendation.

10460
07:35:39,283 --> 07:35:41,600
These are the products
the user may like

10461
07:35:41,600 --> 07:35:42,500
and those products

10462
07:35:42,500 --> 07:35:45,900
can be displayed in the search
result recommendation system

10463
07:35:45,900 --> 07:35:48,017
really works on the basis
of the feedback

10464
07:35:48,017 --> 07:35:50,400
that we are providing
for the earlier products

10465
07:35:50,400 --> 07:35:51,500
that we had bought.

10466
07:35:51,600 --> 07:35:54,225
Bustling dimensionality
reduction whenever

10467
07:35:54,225 --> 07:35:57,300
we do transitioning
with the huge amount of data,

10468
07:35:57,600 --> 07:35:59,511
it's very very compute-intensive

10469
07:35:59,511 --> 07:36:01,900
and we may have
to reduce the dimensions,

10470
07:36:01,900 --> 07:36:03,752
especially the matrix dimensions

10471
07:36:03,752 --> 07:36:07,000
within them early
without losing the features.

10472
07:36:07,000 --> 07:36:09,538
What are the features
available without losing it?

10473
07:36:09,538 --> 07:36:11,308
We should reduce
the dimensionality

10474
07:36:11,308 --> 07:36:13,580
and there are
some algorithms available to do

10475
07:36:13,580 --> 07:36:16,660
that dimensionality reduction
and feature extraction.

10476
07:36:16,660 --> 07:36:19,486
So what are the common features
are features available

10477
07:36:19,486 --> 07:36:22,227
within that particular image
and I can Compare

10478
07:36:22,227 --> 07:36:23,300
what are the common

10479
07:36:23,300 --> 07:36:26,600
across common features
available within those images?

10480
07:36:26,600 --> 07:36:29,106
That's how we
will group those images.

10481
07:36:29,106 --> 07:36:29,716
So get me

10482
07:36:29,716 --> 07:36:32,900
whether this particular image
the person looking

10483
07:36:32,900 --> 07:36:35,300
like this image available
in the database or not.

10484
07:36:35,700 --> 07:36:37,524
For example,
assume the organization

10485
07:36:37,524 --> 07:36:40,600
or the police department crime
Department maintaining a list

10486
07:36:40,600 --> 07:36:44,400
of persons committed crime
and if we get a new photo

10487
07:36:44,400 --> 07:36:48,161
when they do a search they
may not have the exact photo bit

10488
07:36:48,161 --> 07:36:49,200
by bit the photo

10489
07:36:49,200 --> 07:36:51,600
might have been taken
with a different background.

10490
07:36:51,600 --> 07:36:55,000
Front lighting's different
locations different time.

10491
07:36:55,000 --> 07:36:57,754
So a hundred percent the data
will be different on bits

10492
07:36:57,754 --> 07:37:00,520
and bytes will be different
but look nice.

10493
07:37:00,520 --> 07:37:03,767
Yes, they are going to be seeing
so I'm going to search

10494
07:37:03,767 --> 07:37:05,100
the photo looking similar

10495
07:37:05,100 --> 07:37:07,500
to this particular
photograph as the input.

10496
07:37:07,500 --> 07:37:09,033
I'll provide to achieve

10497
07:37:09,033 --> 07:37:11,976
that we will be extracting
the features in each

10498
07:37:11,976 --> 07:37:13,000
of those photos.

10499
07:37:13,000 --> 07:37:15,717
We will extract the features
and we will try to match

10500
07:37:15,717 --> 07:37:17,697
the feature rather than the bits

10501
07:37:17,697 --> 07:37:21,015
and bytes and optimization as
well in terms of processing

10502
07:37:21,015 --> 07:37:22,200
or doing the piping.

10503
07:37:22,200 --> 07:37:25,100
There are a number of algorithms
to do the optimization.

10504
07:37:25,400 --> 07:37:27,000
Let's move on to spark SQL.

10505
07:37:27,100 --> 07:37:29,811
Is there a module
to implement sequence Park?

10506
07:37:29,811 --> 07:37:32,475
How does it work so
directly not the sequel

10507
07:37:32,475 --> 07:37:36,300
may be very similar to high
whatever the structure data

10508
07:37:36,300 --> 07:37:37,300
that we have.

10509
07:37:37,400 --> 07:37:38,800
We can read the data

10510
07:37:38,800 --> 07:37:42,000
or extract the meaning
out of the data using SQL

10511
07:37:42,400 --> 07:37:44,600
and it exposes the APA

10512
07:37:44,700 --> 07:37:48,700
and we can use those API to read
the data or create data frames

10513
07:37:48,834 --> 07:37:51,065
and spunk SQL has four major.

10514
07:37:51,500 --> 07:37:55,800
Degrees data source
data Frame data frame is

10515
07:37:55,800 --> 07:37:58,900
like the representation
of X and Y data

10516
07:37:59,300 --> 07:38:02,800
or like Excel data
multi-dimensional structure data

10517
07:38:03,000 --> 07:38:06,000
and abstract form
on top of dataframe.

10518
07:38:06,000 --> 07:38:08,541
I can do the
query and internally,

10519
07:38:08,541 --> 07:38:11,700
it has interpreter
and Optimizer any query

10520
07:38:11,700 --> 07:38:15,100
I fire that will
get interpreted or optimized

10521
07:38:15,100 --> 07:38:18,500
and get executed using
the SQL services and get

10522
07:38:18,500 --> 07:38:20,300
the data from the data frame

10523
07:38:20,300 --> 07:38:22,900
or it An read the data
from the data source

10524
07:38:22,900 --> 07:38:24,000
and do the processing.

10525
07:38:24,265 --> 07:38:26,034
What is a package file?

10526
07:38:26,100 --> 07:38:27,800
It's a format of the file

10527
07:38:27,800 --> 07:38:30,361
where the data
in some structured form,

10528
07:38:30,361 --> 07:38:33,800
especially the result
of the Spock SQL can be stored

10529
07:38:33,800 --> 07:38:37,350
or returned in some persistence
and the packet again.

10530
07:38:37,350 --> 07:38:41,317
It is a open source from Apache
its data serialization technique

10531
07:38:41,317 --> 07:38:44,833
where we can serialize the data
using the pad could form

10532
07:38:44,833 --> 07:38:46,078
and to precisely say,

10533
07:38:46,078 --> 07:38:47,500
it's a columnar storage.

10534
07:38:47,500 --> 07:38:49,900
It's going to consume
less space it will use

10535
07:38:49,900 --> 07:38:51,200
the keys and values.

10536
07:38:51,300 --> 07:38:55,500
Store the data and also it helps
you to access a specific data

10537
07:38:55,500 --> 07:38:59,100
from that packaged form
using the query so backward.

10538
07:38:59,100 --> 07:39:02,200
It's another open source format
data serialization format

10539
07:39:02,200 --> 07:39:03,267
to store the data

10540
07:39:03,267 --> 07:39:04,900
on purses the data as well

10541
07:39:04,900 --> 07:39:08,700
as to retrieve the data list
the functions of Sparks equal.

10542
07:39:08,700 --> 07:39:10,800
You can be used
to load the varieties

10543
07:39:10,800 --> 07:39:12,300
of structured data, of course,

10544
07:39:12,300 --> 07:39:15,600
yes monks equal can work only
with the structure data.

10545
07:39:15,600 --> 07:39:17,900
It can be used to load varieties

10546
07:39:17,900 --> 07:39:20,900
of structured data
and you can use SQL

10547
07:39:20,900 --> 07:39:23,600
like it's to query
against the program

10548
07:39:23,600 --> 07:39:25,000
and it can be used

10549
07:39:25,000 --> 07:39:27,839
with external tools to connect
to this park as well.

10550
07:39:27,839 --> 07:39:30,400
It gives very good
the integration with the SQL

10551
07:39:30,400 --> 07:39:32,900
and using python
Java Scala code.

10552
07:39:33,000 --> 07:39:35,831
We can create an rdd
from the structure data

10553
07:39:35,831 --> 07:39:38,400
available directly using
this box equal.

10554
07:39:38,400 --> 07:39:40,300
I can generate the TD.

10555
07:39:40,500 --> 07:39:42,600
So it's going to
facilitate the people

10556
07:39:42,600 --> 07:39:46,400
from database background to make
the program faster and quicker.

10557
07:39:47,100 --> 07:39:48,100
Next question is

10558
07:39:48,100 --> 07:39:50,700
what do you understand
by lazy evaluation?

10559
07:39:50,900 --> 07:39:54,400
So whenever you do any operation
within the spark word,

10560
07:39:54,400 --> 07:39:57,281
it will not do the processing
immediately it look

10561
07:39:57,281 --> 07:40:00,100
for the final results
that we are asking for it.

10562
07:40:00,100 --> 07:40:02,000
If it doesn't ask
for the final result.

10563
07:40:02,000 --> 07:40:04,660
It doesn't need to do
the processing So based

10564
07:40:04,660 --> 07:40:07,200
on the final action
until we do the action.

10565
07:40:07,200 --> 07:40:08,990
There will not be
any Transformations.

10566
07:40:08,990 --> 07:40:11,700
I will there will not be
any actual processing happening.

10567
07:40:11,700 --> 07:40:13,141
It will just understand

10568
07:40:13,141 --> 07:40:15,900
what our Transformations
it has to do finally

10569
07:40:15,900 --> 07:40:18,900
if you ask The action
then in optimized way,

10570
07:40:18,900 --> 07:40:22,200
it's going to complete
the data processing and get

10571
07:40:22,200 --> 07:40:23,553
us the final result.

10572
07:40:23,553 --> 07:40:26,600
So to answer straight
lazy evaluation is doing

10573
07:40:26,600 --> 07:40:30,300
the processing one Leon need
of the resultant data.

10574
07:40:30,300 --> 07:40:32,100
The data is not required.

10575
07:40:32,100 --> 07:40:34,757
It's not going
to do the processing.

10576
07:40:34,757 --> 07:40:36,726
Can you use Funk to access

10577
07:40:36,726 --> 07:40:40,200
and analyze data stored
in Cassandra data piece?

10578
07:40:40,200 --> 07:40:41,600
Yes, it is possible.

10579
07:40:41,600 --> 07:40:44,400
Okay, not only Cassandra
any of the nosql database it

10580
07:40:44,400 --> 07:40:46,100
can very well do the processing

10581
07:40:46,100 --> 07:40:49,700
and Sandra also works
in a distributed architecture.

10582
07:40:49,700 --> 07:40:51,200
It's a nosql database

10583
07:40:51,200 --> 07:40:53,800
so it can leverage
the data locality.

10584
07:40:53,800 --> 07:40:56,000
The query can
be executed locally

10585
07:40:56,000 --> 07:40:58,200
where the Cassandra
notes are available.

10586
07:40:58,200 --> 07:41:01,100
It's going to make
the query execution faster

10587
07:41:01,100 --> 07:41:04,326
and reduce the network load
and Spark executors.

10588
07:41:04,326 --> 07:41:06,009
It will try to get started

10589
07:41:06,009 --> 07:41:08,242
or the spark executors
in the mission

10590
07:41:08,242 --> 07:41:10,600
where the Cassandra
notes are available

10591
07:41:10,600 --> 07:41:13,900
or data is available going
to do the processing locally.

10592
07:41:13,900 --> 07:41:16,450
So it's going to leverage
the data locality.

10593
07:41:16,450 --> 07:41:17,426
T next question,

10594
07:41:17,426 --> 07:41:19,500
how can you
minimize data transfers

10595
07:41:19,500 --> 07:41:21,200
when working with spark

10596
07:41:21,200 --> 07:41:23,636
if you ask the core
design the success

10597
07:41:23,636 --> 07:41:25,514
of the spark program depends on

10598
07:41:25,514 --> 07:41:28,300
how much you are reducing
the network transfer.

10599
07:41:28,300 --> 07:41:30,900
This network transfer
is very costly operation

10600
07:41:30,900 --> 07:41:32,300
and you cannot paralyzed

10601
07:41:32,400 --> 07:41:35,600
in case multiple ways are
especially two ways to avoid.

10602
07:41:35,600 --> 07:41:37,664
This one is called
broadcast variable

10603
07:41:37,664 --> 07:41:40,300
and at Co-operators
broadcast variable.

10604
07:41:40,300 --> 07:41:43,536
It will help us
to transfer any static data

10605
07:41:43,536 --> 07:41:46,428
or any informations
keep on publish.

10606
07:41:46,500 --> 07:41:48,300
To multiple systems.

10607
07:41:48,300 --> 07:41:49,300
So I'll see

10608
07:41:49,300 --> 07:41:52,257
if any data to be transferred
to multiple executors

10609
07:41:52,257 --> 07:41:53,500
to be used in common.

10610
07:41:53,500 --> 07:41:55,016
I can broadcast it

10611
07:41:55,200 --> 07:41:58,800
and I might want to consolidate
the values happening

10612
07:41:58,800 --> 07:42:02,172
in multiple workers in
a single centralized location.

10613
07:42:02,172 --> 07:42:03,600
I can use accumulator.

10614
07:42:03,600 --> 07:42:06,412
So this will help us to achieve
the data consolidation

10615
07:42:06,412 --> 07:42:08,800
of data distribution
in the distributed world.

10616
07:42:08,800 --> 07:42:11,800
The ap11 are not abstract level

10617
07:42:11,800 --> 07:42:14,351
where we don't need
to do the heavy lifting

10618
07:42:14,351 --> 07:42:16,600
that's taken care
by the spark for us.

10619
07:42:16,800 --> 07:42:19,275
What our broadcast
variables just now

10620
07:42:19,275 --> 07:42:22,300
as we discussed the value
of the common value

10621
07:42:22,300 --> 07:42:23,200
that we need.

10622
07:42:23,200 --> 07:42:27,300
I am a want that to be available
in multiple executors

10623
07:42:27,300 --> 07:42:31,000
multiple workers simple example
you want to do a spell check

10624
07:42:31,000 --> 07:42:33,500
on the Tweet
Commons the dictionary

10625
07:42:33,500 --> 07:42:36,100
which has the right
list of words.

10626
07:42:36,200 --> 07:42:37,800
I'll have the complete list.

10627
07:42:37,800 --> 07:42:40,300
I want that particular
dictionary to be available

10628
07:42:40,300 --> 07:42:41,400
in each executor

10629
07:42:41,400 --> 07:42:43,944
so that with a task with
that's running locally

10630
07:42:43,944 --> 07:42:46,600
in those Executives can refer
to that particular.

10631
07:42:46,600 --> 07:42:49,900
Task and get the processing
done by avoiding

10632
07:42:49,900 --> 07:42:51,616
the network data transfer.

10633
07:42:51,616 --> 07:42:55,485
So the process of Distributing
the data from the spark context

10634
07:42:55,485 --> 07:42:56,500
to the executors

10635
07:42:56,500 --> 07:42:58,700
where the task going
to run is achieved

10636
07:42:58,700 --> 07:43:00,400
using broadcast variables

10637
07:43:00,400 --> 07:43:03,952
and the built-in within the
spark APA using this parquet p--

10638
07:43:03,952 --> 07:43:06,000
we can create
the bronchus variable

10639
07:43:06,200 --> 07:43:09,500
and the process of Distributing
this data available

10640
07:43:09,500 --> 07:43:13,524
in all executors is taken care
by the spark framework explain

10641
07:43:13,524 --> 07:43:15,000
accumulators in spark.

10642
07:43:15,100 --> 07:43:18,500
The similar way how we
have broadcast variables.

10643
07:43:18,500 --> 07:43:21,290
We have accumulators
as well simple example,

10644
07:43:21,290 --> 07:43:25,100
you want to count how many
error codes are available

10645
07:43:25,100 --> 07:43:26,600
in the distributed environment

10646
07:43:26,800 --> 07:43:28,400
as your data is distributed

10647
07:43:28,400 --> 07:43:31,300
across multiple systems
multiple Executives.

10648
07:43:31,400 --> 07:43:34,784
Each executor will do
the process thing count

10649
07:43:34,784 --> 07:43:37,200
the records anatomically.

10650
07:43:37,200 --> 07:43:38,978
I may want the total count.

10651
07:43:38,978 --> 07:43:42,600
So what I will do I will ask
to maintain an accumulator,

10652
07:43:42,600 --> 07:43:45,250
of course, it will be maintained
in this more context.

10653
07:43:45,250 --> 07:43:48,500
In the driver program
the driver program going

10654
07:43:48,500 --> 07:43:50,100
to be one per application.

10655
07:43:50,100 --> 07:43:52,200
It will keep on
getting accumulated

10656
07:43:52,200 --> 07:43:54,900
and whenever I want I
can read those values

10657
07:43:54,900 --> 07:43:57,100
and take any appropriate action.

10658
07:43:57,200 --> 07:44:00,300
So it's like more or less the
accumulators and practice videos

10659
07:44:00,300 --> 07:44:01,600
looks opposite each other,

10660
07:44:02,000 --> 07:44:03,800
but the purpose
is totally different.

10661
07:44:04,200 --> 07:44:06,531
Why is there a need
for workers variable

10662
07:44:06,531 --> 07:44:10,400
when working with Apache Spark
It's read only variable

10663
07:44:10,400 --> 07:44:13,800
and it will be cached in memory
in a distributed fashion

10664
07:44:13,800 --> 07:44:15,789
and it eliminates the The work

10665
07:44:15,789 --> 07:44:19,012
of moving the data
from a centralized location

10666
07:44:19,012 --> 07:44:20,400
that is Spong driver

10667
07:44:20,400 --> 07:44:24,200
or from a particular program
to all the executors

10668
07:44:24,200 --> 07:44:26,830
within the cluster where
the transfer into get executed.

10669
07:44:26,830 --> 07:44:29,700
We don't need to worry about
where the task will get executed

10670
07:44:29,700 --> 07:44:31,100
within the cluster.

10671
07:44:31,100 --> 07:44:32,138
So when compared

10672
07:44:32,138 --> 07:44:34,900
with the accumulators
broadcast variables,

10673
07:44:34,900 --> 07:44:37,256
it's going to have
a read-only operation.

10674
07:44:37,256 --> 07:44:38,903
The executors cannot change

10675
07:44:38,903 --> 07:44:41,100
the value can only
read those values.

10676
07:44:41,100 --> 07:44:44,900
It cannot update so mostly
will be used like a quiche.

10677
07:44:44,900 --> 07:44:47,400
Have for the
identity next question,

10678
07:44:47,400 --> 07:44:50,327
how can you trigger
automatically naps in spark

10679
07:44:50,327 --> 07:44:52,300
to handle accumulated metadata.

10680
07:44:52,700 --> 07:44:54,500
So there is a parameter

10681
07:44:54,500 --> 07:44:57,900
that we can set TTL the
will get triggered along

10682
07:44:57,900 --> 07:45:00,900
with the running jobs
and intermediately.

10683
07:45:00,900 --> 07:45:04,000
It's going to write the data
result into the disc

10684
07:45:04,000 --> 07:45:07,155
or cleaned unnecessary data
or clean the rdds.

10685
07:45:07,155 --> 07:45:08,600
That's not being used.

10686
07:45:08,600 --> 07:45:09,800
The least used RTD.

10687
07:45:09,800 --> 07:45:10,987
It will get cleaned

10688
07:45:10,987 --> 07:45:14,800
and click keep the metadata as
well as the memory clean water.

10689
07:45:14,800 --> 07:45:17,800
The various levels
of persistence in Apache spark

10690
07:45:17,800 --> 07:45:20,200
when you say data
should be stored in memory.

10691
07:45:20,200 --> 07:45:23,000
It can be indifferent now
you can be possessed it

10692
07:45:23,000 --> 07:45:27,100
so it can be in memory of only
or memory and disk or disk only

10693
07:45:27,200 --> 07:45:30,500
and when it is getting stored
we can ask it to store it

10694
07:45:30,500 --> 07:45:31,800
in a civilized form.

10695
07:45:31,900 --> 07:45:35,300
So the reason why we may store
or possess dress,

10696
07:45:35,303 --> 07:45:36,996
I want this particular

10697
07:45:37,100 --> 07:45:40,200
on very this form
of body little back

10698
07:45:40,200 --> 07:45:42,038
for using so I can really

10699
07:45:42,038 --> 07:45:45,200
back maybe I may not need
it very immediate.

10700
07:45:45,400 --> 07:45:48,477
So I don't want that to keep
occupying my memory.

10701
07:45:48,477 --> 07:45:50,400
I'll write it to the hard disk

10702
07:45:50,400 --> 07:45:52,700
and I'll read it back
whenever there is a need.

10703
07:45:52,700 --> 07:45:55,300
I'll read it back
the next question.

10704
07:45:55,300 --> 07:45:58,069
What do you understand
by schema rdd,

10705
07:45:58,200 --> 07:46:01,900
so schema rdd will be used as
slave Within These Punk's equal.

10706
07:46:01,900 --> 07:46:05,300
So the RTD will have the meta
information built into it.

10707
07:46:05,300 --> 07:46:07,919
It will have the schema
also very similar to

10708
07:46:07,919 --> 07:46:10,642
what we have the database
schema the structure

10709
07:46:10,642 --> 07:46:11,976
of the particular data

10710
07:46:11,976 --> 07:46:14,994
and when I have a structure it
will be easy for me.

10711
07:46:14,994 --> 07:46:16,081
To handle the data

10712
07:46:16,081 --> 07:46:19,100
so data and the structure
will be existing together

10713
07:46:19,100 --> 07:46:20,360
and the schema are ready.

10714
07:46:20,360 --> 07:46:20,550
Now.

10715
07:46:20,550 --> 07:46:22,100
It's called as a data frame

10716
07:46:22,100 --> 07:46:25,009
but it's Mark and dataframe
term is very popular

10717
07:46:25,009 --> 07:46:27,616
in languages like our
as other languages.

10718
07:46:27,616 --> 07:46:28,700
It's very popular.

10719
07:46:28,700 --> 07:46:31,700
So it's going to have the data
and The Meta information

10720
07:46:31,700 --> 07:46:34,700
about that data saying
what column was structure it.

10721
07:46:34,700 --> 07:46:36,300
Is it explain the scenario

10722
07:46:36,300 --> 07:46:38,656
where you will be
using spark streaming

10723
07:46:38,656 --> 07:46:41,200
as you may want to do
a sentiment analysis

10724
07:46:41,200 --> 07:46:44,200
of Twitter's so there
I will be streamed

10725
07:46:44,400 --> 07:46:49,200
so we will Flume sort of a tool
to harvest the information

10726
07:46:49,300 --> 07:46:52,700
from Peter and fit it
into spark streaming.

10727
07:46:52,700 --> 07:46:56,300
It will extract or identify
the sentiment of each

10728
07:46:56,300 --> 07:46:58,300
and every tweet and Market

10729
07:46:58,300 --> 07:47:00,899
whether it is positive
or negative and accordingly

10730
07:47:00,899 --> 07:47:02,900
the data will be
the structure data

10731
07:47:02,900 --> 07:47:03,700
that we tidy

10732
07:47:03,700 --> 07:47:05,742
whether it is positive
or negative maybe

10733
07:47:05,742 --> 07:47:06,856
percentage of positive

10734
07:47:06,856 --> 07:47:09,088
and percentage of negative
sentiment store it

10735
07:47:09,088 --> 07:47:10,500
in some structured form.

10736
07:47:10,500 --> 07:47:14,111
Then you can leverage this park
Sequel and do grouping

10737
07:47:14,111 --> 07:47:16,403
or filtering Based
on the sentiment

10738
07:47:16,403 --> 07:47:19,587
and maybe I can use
a machine learning algorithm.

10739
07:47:19,587 --> 07:47:22,107
What drives that
particular tweet to be

10740
07:47:22,107 --> 07:47:23,500
in the negative side.

10741
07:47:23,500 --> 07:47:26,700
Is there any similarity between
all this negative sentiment

10742
07:47:26,700 --> 07:47:28,812
negative tweets may be specific

10743
07:47:28,812 --> 07:47:32,700
to a product a specific time
by when the Tweet was sweeter

10744
07:47:32,700 --> 07:47:34,421
or from a specific region

10745
07:47:34,421 --> 07:47:36,900
that we it was
Twitter those analysis

10746
07:47:36,900 --> 07:47:40,194
could be done by leveraging
the MLA above spark.

10747
07:47:40,194 --> 07:47:43,700
So Emily streaming core
all going to work together.

10748
07:47:43,700 --> 07:47:45,200
All these are like different.

10749
07:47:45,200 --> 07:47:48,500
Offerings available to
solve different problems.

10750
07:47:48,600 --> 07:47:51,100
So with this we are coming
to end of this interview

10751
07:47:51,100 --> 07:47:53,100
questions discussion of spark.

10752
07:47:53,100 --> 07:47:54,465
I hope you all enjoyed.

10753
07:47:54,465 --> 07:47:56,913
I hope it was constructive
and useful one.

10754
07:47:56,913 --> 07:47:59,600
The more information
about editor is available

10755
07:47:59,600 --> 07:48:02,183
in this website to record
at cou only best

10756
07:48:02,183 --> 07:48:05,900
and keep visiting the website
for blocks and latest updates.

10757
07:48:05,900 --> 07:48:07,000
Thank you folks.

10758
07:48:07,500 --> 07:48:10,400
I hope you have enjoyed
listening to this video.

10759
07:48:10,400 --> 07:48:12,450
Please be kind enough to like it

10760
07:48:12,450 --> 07:48:15,600
and you can comment any
of your doubts and queries

10761
07:48:15,600 --> 07:48:17,078
and we will reply them

10762
07:48:17,078 --> 07:48:20,923
at the earliest do look out
for more videos in our playlist

10763
07:48:20,923 --> 07:48:24,105
And subscribe to Edureka
channel to learn more.

10764
07:48:24,105 --> 07:48:25,100
Happy learning.
